Bug #13022

crunch-run broken container loop

Added by Peter Amstutz over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
02/05/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

https://workbench.9tee4.arvadosapi.com/container_requests/9tee4-xvhdp-vopb57pt6o9eij1#Log

Failed partway through initialization:

2018-02-01T20:05:03.402107528Z While attaching container stdout/stderr streams: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: no such file or directory
2018-02-01T20:05:03.470730548Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.743593838/keep576772597]

Then it gets stuck in a loop trying to re-run the container:

2018-02-01T20:06:03.263329220Z Creating Docker container
2018-02-01T20:06:03.267277338Z While creating container: Error response from daemon: Conflict. The name "/9tee4-dz642-gobx4a24ihi8xpj" is already in use by container d2fd14fd8d99ff51fb31b489c285eb767a0309cc64d37317250ce5c0ee7b5802. You have to remove (or rename) that container to be able to reuse that name.
2018-02-01T20:06:03.345808678Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.248318477/keep062669320] 

In addition, arv-mount apparently gets terminated (maybe by slurm doing killpg?) but the run directory is left in /tmp and there is a dangling mountpoint in mtab.

Looking at compute0.9tee4, I saw evidence (garbage in /tmp) that this has happened before.


Subtasks

Task #13030: Review 13022-tmp-cleanupResolvedTom Clegg


Related issues

Related to Arvados - Bug #13095: when slurm murders a crunch2 job because it exceeds the memory limit, the container is left with a null `log`Closed

Associated revisions

Revision 0fab8a58
Added by Tom Clegg over 3 years ago

Merge branch '13022-tmp-cleanup'

fixes #13022

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Peter Amstutz over 3 years ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz over 3 years ago

  • Status changed from In Progress to New

#3 Updated by Peter Amstutz over 3 years ago

  • Description updated (diff)

#4 Updated by Peter Amstutz over 3 years ago

  • Description updated (diff)

#5 Updated by Tom Clegg over 3 years ago

  • Assigned To set to Tom Clegg

#6 Updated by Tom Clegg over 3 years ago

End of slurm-2297.out on compute0.9tee4, whose temp dir was not removed:

9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:46.856481052Z Complete
9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:47.260490146Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-ml66hs2c3ook85z.721044901/keep566145728]
slurmstepd: error: *** JOB 2297 CANCELLED AT 2018-02-02T21:18:47 *** on compute0
slurmstepd: error: _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_0/job_2297/step_batch: Device or resource busy

Calling stopSignals() before CleanupDirs() means we abandon CleanupDirs() when crunch-dispatch-slurm sends a[nother] TERM signal after the container has exited.

AFAIK we always want to do an orderly shutdown no matter when we get SIGTERM, so the solution seems to be
  1. remove stopSignals() entirely
  2. hold cStateLock in CommitLogs() to prevent the signal handler from using CrunchLog while CommitLogs is closing it and swapping it out for a new one

#7 Updated by Tom Clegg over 3 years ago

  • Status changed from New to In Progress

#9 Updated by Lucas Di Pentima over 3 years ago

LGTM

#10 Updated by Anonymous over 3 years ago

  • Status changed from In Progress to Resolved

#11 Updated by Tom Clegg over 3 years ago

  • Related to Bug #13095: when slurm murders a crunch2 job because it exceeds the memory limit, the container is left with a null `log` added

#12 Updated by Tom Morris about 3 years ago

  • Release set to 17

Also available in: Atom PDF