Idea #21408
closedtest-provision-debian11 fails loading workflow Docker image
Description
This started after merging #21230 and #21394.
Running test CWL workflow INFO /usr/bin/cwl-runner 2.8.0.dev20240119143735, arvados-python-client 2.8.0.dev20240119143735, cwltool 3.1.20230601100705 INFO Resolved 'hasher-workflow.cwl' to 'file:///tmp/cluster_tests/hasher-workflow.cwl' INFO hasher-workflow.cwl:36:7: Unknown hint WorkReuse INFO hasher-workflow.cwl:50:7: Unknown hint WorkReuse INFO hasher-workflow.cwl:64:7: Unknown hint WorkReuse INFO Using cluster dbn11 (https://dbn11.local:443/) DEBUG ENTER upload_job_order 1705679832.0055475 DEBUG ENTER scandeps 1705679832.007659 DEBUG EXIT scandeps 1705679832.0088325 0.0011734962463378906 DEBUG ENTER collect uuids 1705679832.0089567 DEBUG EXIT collect uuids 1705679832.0090437 8.702278137207031e-05 DEBUG ENTER collect uploads 1705679832.0090969 DEBUG EXIT collect uploads 1705679832.0091777 8.082389831542969e-05 DEBUG ENTER fetch_uuids 1705679832.0092268 DEBUG EXIT fetch_uuids 1705679832.0092647 3.790855407714844e-05 DEBUG ENTER mapper 1705679832.010263 INFO Upload local files: "test.txt" INFO Uploaded to f55e750025853f5b8ccae3ca79240f65+54 (dbn11-4zz18-efp6b3h7f97ysul) DEBUG EXIT mapper 1705679832.0954368 0.0851738452911377 DEBUG ENTER collectloc 1705679832.0955882 DEBUG EXIT collectloc 1705679832.0957391 0.00015091896057128906 DEBUG ENTER setloc 1705679832.0959854 DEBUG EXIT setloc 1705679832.096145 0.0001595020294189453 DEBUG EXIT upload_job_order 1705679832.0962608 0.0907132625579834 INFO Uploading workflow dependencies DEBUG ENTER upload_workflow_deps 1705679832.096431 DEBUG ENTER upload_docker 1705679832.096509 INFO ['docker', 'pull', 'arvados/jobs:2.8.0.dev20240119143735'] 2.8.0.dev20240119143735: Pulling from arvados/jobs [`docker pull` progress omitted] Digest: sha256:f7991a0df2f3dd11b51a3b2aaba0ceb6af5c61195f5671daca6e963d4d8aa487 Status: Downloaded newer image for arvados/jobs:2.8.0.dev20240119143735 docker.io/arvados/jobs:2.8.0.dev20240119143735 INFO Uploading Docker image arvados/jobs:2.8.0.dev20240119143735 2024-01-19 15:57:30 arvados.arv_put[41618] INFO: Creating new cache file at /root/.cache/arvados/arv-put/f8be69c1db20b934481409a3f508ad24 2024-01-19 15:57:32 arvados.arv_put[41618] INFO: Collection saved as 'Docker image arvados jobs:2.8.0.dev20240119143735 sha256:b3d59' dbn11-4zz18-cky0zrtpcygez8q 2024-01-19 15:57:32 arvados.cwl-runner.metrics[41618] DEBUG: EXIT upload_docker 1705679852.4451454 20.348636388778687 (X-Request-Id: req-z32hupda8gmghtycmu3f) 2024-01-19 15:57:32 arvados.cwl-runner.metrics[41618] DEBUG: EXIT upload_workflow_deps 1705679852.445314 20.348882913589478 2024-01-19 15:57:32 cwltool[41618] ERROR: Workflow error: "filename 'b3d596985843bccad4c72223479cb9b8159cf6483fd0bfbe732ef152c10e7fcd.json' not found" Traceback (most recent call last): File "/usr/share/python3.9/dist/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/arvdocker.py", line 141, in arv_docker_get_image arvados.commands.keepdocker.main(args, stdout=sys.stderr, install_sig_handlers=False, api=api_client) File "/usr/share/python3.9/dist/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados/commands/keepdocker.py", line 546, in main json_file = image_tar.extractfile(image_tar.getmember(json_filename)) File "/usr/lib/python3.9/tarfile.py", line 1790, in getmember raise KeyError("filename %r not found" % name) KeyError: "filename 'b3d596985843bccad4c72223479cb9b8159cf6483fd0bfbe732ef152c10e7fcd.json' not found" During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/python3.9/dist/python3-arvados-cwl-runner/lib/python3.9/site-packages/cwltool/main.py", line 1301, in main (out, status) = real_executor( File "/usr/share/python3.9/dist/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/executor.py", line 695, in arv_executor merged_map = upload_workflow_deps(self, tool, runtimeContext) File "/usr/share/python3.9/dist/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/runner.py", line 741, in upload_workflow_deps upload_docker(arvrunner, tool, runtimeContext) File "/usr/share/python3.9/dist/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/runner.py", line 557, in upload_docker upload_docker(arvrunner, s.embedded_tool, runtimeContext) File "/usr/share/python3.9/dist/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/runner.py", line 551, in upload_docker arvados_cwl.arvdocker.arv_docker_get_image(arvrunner.api, docker_req, True, runtimeContext) File "/usr/share/python3.9/dist/python3-arvados-cwl-runner/lib/python3.9/site-packages/arvados_cwl/arvdocker.py", line 157, in arv_docker_get_image cached_lookups[dockerRequirement["dockerImageId"]] = pdh File "schema_salad/sourceline.py", line 250, in __exit__ cwltool.errors.WorkflowException: "filename 'b3d596985843bccad4c72223479cb9b8159cf6483fd0bfbe732ef152c10e7fcd.json' not found" Build step 'Execute shell' marked build as failure
Files
Updated by Brett Smith 11 months ago
Updated by Brett Smith 11 months ago
The file in question does exist in the image. Did something go awry saving/loading the tar archive?
$ docker save 'arvados/jobs:2.8.0.dev20240119143735' | tar -t | grep -F 'b3d596985843bccad4c72223479cb9b8159cf6483fd0bfbe732ef152c10e7fcd.json'; echo $? b3d596985843bccad4c72223479cb9b8159cf6483fd0bfbe732ef152c10e7fcd.json 0
Updated by Brett Smith 11 months ago
Brett Smith wrote in #note-2:
The file in question does exist in the image. Did something go awry saving/loading the tar archive?
Doesn't seem like it, a rerun failed in the exact same way: test-provision-debian11: #584
Updated by Peter Amstutz 11 months ago
Brett Smith wrote in #note-3:
Brett Smith wrote in #note-2:
The file in question does exist in the image. Did something go awry saving/loading the tar archive?
Doesn't seem like it, a rerun failed in the exact same way: test-provision-debian11: #584
I can reproduce this locally, I just haven't been able to dig into it yet.
Updated by Peter Amstutz 11 months ago
root@05ac4094679e:/var/lib/arvados-arvbox/.cache/arvados/docker# tar tf './sha256:b3d596985843bccad4c72223479cb9b8159cf6483fd0bfbe732ef152c10e7fcd.tar' blobs/ blobs/sha256/ blobs/sha256/12903f2a4aeda35394dc110df288a667dad7276f5dcceb595ba0a202f3441c84 blobs/sha256/2694fb8f3a9db9e88d1d79fe2690d011385264be9ccea4430079e7c52e60f778 blobs/sha256/2988be0dc4d0d29d9d0bc466524cd83cacfaf80c5d1189396d71ac3b7a931ed9 blobs/sha256/2fa2daa972a9827f3f601e6ea83f721451b87bf1ee0a2c6f8c8bb8963fbb2b9e blobs/sha256/5b51bb2fc9b4a1d792a3957e67d1ad786d827a512a998005be7abc741ce7ed9d blobs/sha256/87795966100c75db8a99441a415f004e5f7b2f037ec19d77370a0a42bcb0848d blobs/sha256/9d0f57cd0d488542afca7714d11d8b25ecb09b16dbd18857e2ddc03f0a757388 blobs/sha256/b3d596985843bccad4c72223479cb9b8159cf6483fd0bfbe732ef152c10e7fcd blobs/sha256/cabf5b2ebcdc289c7757efc0fbcf983da6f23426d5b24e38edb9880ceac73c75 blobs/sha256/e02922dd07706d31258ea0743d75ee0d1f9aa59d9c618c9ab957d072c4628dcc blobs/sha256/e6cd39d4cea4051114c9bf1c61995efdb7b33a54c7b46b33b221f8371ae8c2b7 blobs/sha256/f88b3b7e08fcd32afc11571aee4747b7a9a36e3ea655fd8d729f14217eeeace3 blobs/sha256/fbfef1f52bf8a87e2b66659f24263926701e4eda4bfeb4aa2698344e50ea80d6 blobs/sha256/fc745bd5939c6cd55a206d0f45f99351501c3b874c2c0fe5fec245ff2233bfa1 index.json manifest.json oci-layout root@05ac4094679e:/var/lib/arvados-arvbox/.cache/arvados/docker# docker --version Docker version 25.0.0, build e758fe5
Updated by Peter Amstutz 11 months ago
https://docs.docker.com/engine/release-notes/25.0/
The docker image save tarball output is now OCI compliant. moby/moby#44598
The format inside the tarball changed. Arv-keepdocker looks inside the tarball to get the "created" timestamp. This is added to the Arvados metadata and used to resolve conflicts when there are two images with the same name, by choosing the one created more recently (independent of when the images were actually uploaded).
Either Keepdocker needs to be updated to handle OCI images, or we should remove the image inspection behavior altogether (I believe it already falls back to using the Arvados collection timestamp to resolve conflicts if there's no image_timestamp
).
Image handling in general is long overdue to be revisited, but that's out of scope for this ticket.
Updated by Tom Clegg 11 months ago
- Related to Bug #21417: Stop trying to read image timestamp from docker metadata in arv-keepdocker added
Updated by Brett Smith 11 months ago
- Target version changed from To be scheduled to Development 2024-02-14 sprint
- Status changed from New to Resolved
Fixed after merging #21417. test-provision-debian11: #588