crunch-run broken container loop
Failed partway through initialization:
2018-02-01T20:05:03.402107528Z While attaching container stdout/stderr streams: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: no such file or directory 2018-02-01T20:05:03.470730548Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.743593838/keep576772597]
Then it gets stuck in a loop trying to re-run the container:
2018-02-01T20:06:03.263329220Z Creating Docker container 2018-02-01T20:06:03.267277338Z While creating container: Error response from daemon: Conflict. The name "/9tee4-dz642-gobx4a24ihi8xpj" is already in use by container d2fd14fd8d99ff51fb31b489c285eb767a0309cc64d37317250ce5c0ee7b5802. You have to remove (or rename) that container to be able to reuse that name. 2018-02-01T20:06:03.345808678Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.248318477/keep062669320]
In addition, arv-mount apparently gets terminated (maybe by slurm doing killpg?) but the run directory is left in /tmp and there is a dangling mountpoint in mtab.
Looking at compute0.9tee4, I saw evidence (garbage in /tmp) that this has happened before.
#6 Updated by Tom Clegg about 3 years ago
End of slurm-2297.out on compute0.9tee4, whose temp dir was not removed:
9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:46.856481052Z Complete 9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:47.260490146Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-ml66hs2c3ook85z.721044901/keep566145728] slurmstepd: error: *** JOB 2297 CANCELLED AT 2018-02-02T21:18:47 *** on compute0 slurmstepd: error: _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_0/job_2297/step_batch: Device or resource busy
Calling stopSignals() before CleanupDirs() means we abandon CleanupDirs() when crunch-dispatch-slurm sends a[nother] TERM signal after the container has exited.AFAIK we always want to do an orderly shutdown no matter when we get SIGTERM, so the solution seems to be
- remove stopSignals() entirely
- hold cStateLock in CommitLogs() to prevent the signal handler from using CrunchLog while CommitLogs is closing it and swapping it out for a new one