Bug #22617
closedCompute node images don't ensure encrypted partitions because of cycle with docker.socket
Description
Brett to add details, but this happened on tordo-xvhdp-kfkn39r3glw8n2y, and you can see it in tordo's a-d-c logs for 2025-02-26.
Updated by Tom Clegg about 2 months ago
{"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","InstanceTypes":"g4dnxlarge","PID":304509,"Priority":562948212891007462,"State":"Queued","level":"info","msg":"adding container to queue","time":"2025-02-26T02:49:01.451755748Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"creating new instance","time":"2025-02-26T02:49:01.502047832Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T02:49:06.488994241Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T02:50:17.610689582Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T02:50:21.875754840Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T02:50:32.015323461Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T02:51:57.243769580Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T02:52:02.405996488Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T02:52:07.387843463Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T02:53:17.477243009Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T02:53:22.812441405Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T02:53:32.949572697Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T02:54:37.260987523Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T02:54:43.238247124Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T02:54:53.375432657Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T02:55:57.131335143Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T02:56:03.683812785Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T02:56:13.824259732Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T02:57:17.239692138Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T02:57:24.124246532Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T02:57:34.258460911Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T02:58:37.144604036Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T02:58:44.560316352Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T02:58:54.701662518Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T03:00:07.452990139Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T03:00:15.063772714Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T03:00:25.200943323Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T03:01:37.197851921Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"Priority":562948212891007462,"State":"Locked","level":"info","msg":"requeueing locked container because crunch-run exited","time":"2025-02-26T03:01:45.545731863Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process started","time":"2025-02-26T03:01:55.696853472Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"Reason":"state=Cancelled","level":"info","msg":"killing crunch-run process","time":"2025-02-26T03:03:06.014959745Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"Signal":15,"level":"info","msg":"sending signal","time":"2025-02-26T03:03:06.015084807Z"} {"Address":"10.253.254.242","ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","Instance":"i-0afbd98f698296a88","InstanceType":"g4dnxlarge","PID":304509,"level":"info","msg":"crunch-run process ended","time":"2025-02-26T03:03:06.035275902Z"} {"ClusterID":"tordo","ContainerUUID":"tordo-dz642-45mf7na7mgozajk","PID":304509,"State":"Cancelled","level":"info","msg":"dropping container from queue","time":"2025-02-26T03:03:16.047103405Z"}
Updated by Brett Smith 15 days ago
This happened again on pirca on 2025-04-07 running 3.1.1~rc2. It affected container pirca-dz642-cgjbiz1wnjwa9kn. This was one step in a larger workflow pirca-xvhdp-f68qx7rob3p65k0 where every other step apparently ran without issue.
When this happens the user experience is miserable. The process page in Workbench can't show any logs, just "Error: Failed to start container. Cancelled after exceeding 'Containers.MaxDispatchAttempts'". The larger workflow fails because of things like "cwltool: [step mark-duplicates] Output is missing expected field keep:723ccb12f21518b2fe936ec68d2d5927+2900/helper/bwamem-gatk-report-wf.cwl#mark-duplicates/dupbam" which sounds like a problem with your workflow but it's not. The reason that field is missing is because your step didn't run. This problem stems from the presentation in a-c-r's logs:
2025-04-07T16:17:03.088631091Z INFO [container mark-duplicates_2] pirca-xvhdp-bu1zt795wlyxsr2 state is Committed 2025-04-07T16:18:38.905393926Z INFO [workflow bwamem-gatk-report_2] starting step mark-duplicates_2 2025-04-07T16:18:38.905530411Z INFO [step mark-duplicates_2] start 2025-04-07T16:26:02.964049781Z INFO [step mark-duplicates_2] completed success 2025-04-07T18:02:02.572908811Z INFO [container mark-duplicates_2] pirca-xvhdp-bu1zt795wlyxsr2 is Final 2025-04-07T18:02:02.647185846Z ERROR [step mark-duplicates] Output is missing expected field keep:723ccb12f21518b2fe936ec68d2d5927+2900/helper/bwamem-gatk-report-wf.cwl#mark-duplicates/dupbam 2025-04-07T18:02:02.729833977Z ERROR [step mark-duplicates] Output is missing expected field keep:723ccb12f21518b2fe936ec68d2d5927+2900/helper/bwamem-gatk-report-wf.cwl#mark-duplicates/dupmetrics 2025-04-07T18:02:02.861320386Z WARNING [step mark-duplicates] completed permanentFail
Updated by Tom Clegg 15 days ago
- a-d-c should give up on that instance / retry on a different one, not just retry the same instance until giving up
- a-d-c should have a mechanism for retrieving log messages (from journald?) for crunch-run attempts that didn't get as far as creating a log collection, and saving them to a log collection and/or a-d-c logs
- a-c-r should report "did not run" instead of "missing expected field"
Updated by Brett Smith 14 days ago
Tom Clegg wrote in #note-3:
It sounds like the problem is that the instance is unusable (so, not the container's fault). We could make a few improvements:
- a-d-c should give up on that instance / retry on a different one, not just retry the same instance until giving up
+1. Note that I don't think it should just do that: it should retry a "few" times on the same instance (maybe with a little backoff), then give up on the instance if there's no progress. We'd have to talk through how to reconcile that with the existing MaxDispatchAttempts
configuration. But this would be the change that makes the most progress towards workflows succeeding.
- a-d-c should have a mechanism for retrieving log messages (from journald?) for crunch-run attempts that didn't get as far as creating a log collection, and saving them to a log collection and/or a-d-c logs
That's #21581.
- a-c-r should report "did not run" instead of "missing expected field"
+1. As a reader I don't know what to make of the fact that a-c-r logs the container is "completed success" when it apparently never is. Maybe it's logging that a submission or dispatch happened, but if so the message could be more specific.
Updated by Brett Smith 14 days ago
I now have reason to suspect that this is not a transient error but a new bug in crunch-run 3.1.0 that causes it to crash while handling containers with some as-yet-determined property. #22763 documents it failing on the same step in a handful of similar-but-not-identical workflow runs. We have never seen that workflow succeed on crunch-run 3.1.0+. Note that when I initially filed this bug a month ago, it was on tordo, so it would've been running post-3.0.0 code. As seen in the logs for parent workflow tordo-xvhdp-kzwf8sdjmevkeht:
2025-02-26T00:39:48.533290016Z crunch-run 3.1.0~dev20250220185700 (go1.23.6) started
Updated by Brett Smith 14 days ago
- Related to Bug #22763: Figure out why WGS-processing chr19 crashes crunch-run 3.1.0 added
Updated by Brett Smith 13 days ago
- Target version set to Development 2025-04-16
- Assigned To set to Tom Clegg
- Subject changed from arvados-dispatch-cloud got in a tight loop of retrying crunch-run when it failed immediately to crunch-run 3.1.0 fails immediately when trying to start certain containers
Updated by Brett Smith 13 days ago
Tom pointed out this could be a Debian upgrade issue rather than a 3.1.0 issue. As it happens, tordo is currently running 3.1.0 everywhere. We have built a temporary compute node image for it based on Debian 11, and now the "3.1.0 on Debian 11" test workflow is tordo-xvhdp-uza14416sntuzy2.
Updated by Brett Smith 12 days ago
"Debian 11 on 3.1.0" test runs tordo-xvhdp-uza14416sntuzy2 and tordo-xvhdp-w0rfx0v5b5949tw both failed in a way that is very similar to the Debian 12 test runs, but not quite identical. They both got past the mark-duplicates step, but then a later gather-GVCFs step failed for exceeding MaxDispatchAttempts.
Updated by Brett Smith 12 days ago
I have the entire journal from this run attempt but I don't want to post it publicly. But:
Apr 10 08:49:47 dockerd[745]: time="2025-04-10T12:49:47.089168237Z" level=info msg="Layer sha256:5bcad8f8b858991e09c1ea59d2b1d8a1fcb424445493825972918b48c5170289 cleaned up" Apr 10 08:49:47 dockerd[745]: time="2025-04-10T12:49:47.089960686Z" level=info msg="Layer sha256:9adaba72eca85f66961ee4fb5c0b646b078377439690fa94827bf35a5f0cfd5e cleaned up" Apr 10 08:49:47 dockerd[745]: time="2025-04-10T12:49:47.090707511Z" level=info msg="Layer sha256:ccab9ec9909359830f26cd442a80b08ab27a8fd34881495da7a17ac224ceebf1 cleaned up" Apr 10 08:49:47 dockerd[745]: time="2025-04-10T12:49:47.191874552Z" level=info msg="Layer sha256:739482a9723dbee6dbdad6b669090e96d5f57e2aa27c131015cce8969d0d4efa cleaned up" Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.166087313Z loaded image: response {"errorDetail":{"message":"write /usr/local/lib/R/site-library/BH/include/boost/archive/iterators/xml_escape.hpp: no space left on device"},"error":"write /usr/local/lib/R/site-library/BH/include/boost/archive/iterators/xml_escape.hpp: no space left on device"} Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.166160514Z Creating Docker container Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.169728752Z error in Run: While creating container: Error response from daemon: No such image: sha256:e7b5bddabcf740c7a382991098225f067134790b480049b0b8dd53bc38d8814b Apr 10 08:49:48 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:48.206013218Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.tordo-dz642-oljtukc4zlwwwpl.1732696392/keep2513045995] Apr 10 08:49:48 crunch-run[1136]: 2025-04-10T12:49:48.475241603Z fusermount: failed to unmount /tmp/crunch-run.tordo-dz642-oljtukc4zlwwwpl.1732696392/keep2513045995: Invalid argument Apr 10 08:49:48 systemd[1]: tmp-crunch\x2drun.tordo\x2ddz642\x2doljtukc4zlwwwpl.1732696392-keep2513045995.mount: Succeeded. Apr 10 08:49:48 systemd[837]: tmp-crunch\x2drun.tordo\x2ddz642\x2doljtukc4zlwwwpl.1732696392-keep2513045995.mount: Succeeded. Apr 10 08:49:50 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl 2025-04-10T12:49:50.758562807Z crunch-run finished Apr 10 08:49:50 crunch-run[1136]: tordo-dz642-oljtukc4zlwwwpl: While creating container: Error response from daemon: No such image: sha256:e7b5bddabcf740c7a382991098225f067134790b480049b0b8dd53bc38d8814b
Updated by Brett Smith 12 days ago
Apr 10 08:48:42 systemd[1]: docker.socket: Found ordering cycle on arvados-ensure-encrypted-partitions.service/start Apr 10 08:48:42 systemd[1]: docker.socket: Found dependency on basic.target/start Apr 10 08:48:42 systemd[1]: docker.socket: Found dependency on sockets.target/start Apr 10 08:48:42 systemd[1]: docker.socket: Found dependency on docker.socket/start Apr 10 08:48:42 systemd[1]: docker.socket: Job arvados-ensure-encrypted-partitions.service/start deleted to break ordering cycle starting with docker.socket/start
Updated by Brett Smith 12 days ago
- Assigned To changed from Tom Clegg to Brett Smith
- Status changed from New to In Progress
- Subject changed from crunch-run 3.1.0 fails immediately when trying to start certain containers to Compute node images don't ensure encrypted partitions because of cycle with docker.socket
Updating this bug to reflect the root cause of the failure. The way crunch-run reacts to a failure to load the Docker image is definitely not ideal; I will spin out a separate bug about that.
Updated by Brett Smith 12 days ago
22617-fix-encrypted-partitions @ 0f2503823ab9842491da30980541917654513c59
tordo compute node image build: packer-build-compute-image: #322
WGS workflow now passing: tordo-xvhdp-qk0de7b3x8xvdwn
- All agreed upon points are implemented / addressed.
- The first commit fixes the cycle by simply removing the relationship with
docker.socket
. As noted in the commit message, the existing relationship withdocker.service
is sufficient. I am confident this was effective in the image build because in tordo's a-d-c logs, it was previously reporting a higher-than-usual rate of broken cloud instances. These are the ones that likely got "fixed" by systemd choosing to removedocker.socket
from the runtime target instead. Zero instances were reported broken during the latest pipeline run.
The second commit addresses #22763#note-14. It was possible to address this purely with Ansible variables, and I did that for at least one build, but having this logic in the default is closer to the behavior we want in different cases. You can tell this took because the image build log shows that installing the Arvados package pin was skipped.
- The first commit fixes the cycle by simply removing the relationship with
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- Yes, many, see thread
- Code is tested and passing, both automated and manual, what manual testing was done is described
- See above
- Documentation has been updated.
- N/A, pure bugfix
- Behaves appropriately at the intended scale (describe intended scale).
- No change in scale
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- N/A (no Ansible style guide)
Updated by Brett Smith 12 days ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|86c98d31c7810abed4e7c037c40aad2a1f3cf938.
Updated by Brett Smith 11 days ago
- Release set to 77
- Category changed from Crunch to Deployment