Bug #13022
crunch-run broken container loop
100%
Description
https://workbench.9tee4.arvadosapi.com/container_requests/9tee4-xvhdp-vopb57pt6o9eij1#Log
Failed partway through initialization:
2018-02-01T20:05:03.402107528Z While attaching container stdout/stderr streams: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: no such file or directory 2018-02-01T20:05:03.470730548Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.743593838/keep576772597]
Then it gets stuck in a loop trying to re-run the container:
2018-02-01T20:06:03.263329220Z Creating Docker container 2018-02-01T20:06:03.267277338Z While creating container: Error response from daemon: Conflict. The name "/9tee4-dz642-gobx4a24ihi8xpj" is already in use by container d2fd14fd8d99ff51fb31b489c285eb767a0309cc64d37317250ce5c0ee7b5802. You have to remove (or rename) that container to be able to reuse that name. 2018-02-01T20:06:03.345808678Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.248318477/keep062669320]
In addition, arv-mount apparently gets terminated (maybe by slurm doing killpg?) but the run directory is left in /tmp and there is a dangling mountpoint in mtab.
Looking at compute0.9tee4, I saw evidence (garbage in /tmp) that this has happened before.
Subtasks
Related issues
Associated revisions
History
#1
Updated by Peter Amstutz about 3 years ago
- Status changed from New to In Progress
#2
Updated by Peter Amstutz about 3 years ago
- Status changed from In Progress to New
#3
Updated by Peter Amstutz about 3 years ago
- Description updated (diff)
#4
Updated by Peter Amstutz about 3 years ago
- Description updated (diff)
#5
Updated by Tom Clegg about 3 years ago
- Assigned To set to Tom Clegg
#6
Updated by Tom Clegg about 3 years ago
End of slurm-2297.out on compute0.9tee4, whose temp dir was not removed:
9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:46.856481052Z Complete 9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:47.260490146Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-ml66hs2c3ook85z.721044901/keep566145728] slurmstepd: error: *** JOB 2297 CANCELLED AT 2018-02-02T21:18:47 *** on compute0 slurmstepd: error: _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_0/job_2297/step_batch: Device or resource busy
Calling stopSignals() before CleanupDirs() means we abandon CleanupDirs() when crunch-dispatch-slurm sends a[nother] TERM signal after the container has exited.
AFAIK we always want to do an orderly shutdown no matter when we get SIGTERM, so the solution seems to be- remove stopSignals() entirely
- hold cStateLock in CommitLogs() to prevent the signal handler from using CrunchLog while CommitLogs is closing it and swapping it out for a new one
#7
Updated by Tom Clegg about 3 years ago
- Status changed from New to In Progress
#8
Updated by Tom Clegg about 3 years ago
13022-tmp-cleanup @ a6228b5228c807fcee897da58d18ee542e930d77
#9
Updated by Lucas Di Pentima about 3 years ago
LGTM
#10
Updated by Anonymous about 3 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|0fab8a581c4a711408150ed64ce909d9afda7829.
#11
Updated by Tom Clegg about 3 years ago
- Related to Bug #13095: when slurm murders a crunch2 job because it exceeds the memory limit, the container is left with a null `log` added
#12
Updated by Tom Morris over 2 years ago
- Release set to 17
Merge branch '13022-tmp-cleanup'
fixes #13022
Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tclegg@veritasgenetics.com>