Project

General

Profile

Actions

Bug #13022

closed

crunch-run broken container loop

Added by Peter Amstutz about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Description

https://workbench.9tee4.arvadosapi.com/container_requests/9tee4-xvhdp-vopb57pt6o9eij1#Log

Failed partway through initialization:

2018-02-01T20:05:03.402107528Z While attaching container stdout/stderr streams: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: no such file or directory
2018-02-01T20:05:03.470730548Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.743593838/keep576772597]

Then it gets stuck in a loop trying to re-run the container:

2018-02-01T20:06:03.263329220Z Creating Docker container
2018-02-01T20:06:03.267277338Z While creating container: Error response from daemon: Conflict. The name "/9tee4-dz642-gobx4a24ihi8xpj" is already in use by container d2fd14fd8d99ff51fb31b489c285eb767a0309cc64d37317250ce5c0ee7b5802. You have to remove (or rename) that container to be able to reuse that name.
2018-02-01T20:06:03.345808678Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-gobx4a24ihi8xpj.248318477/keep062669320] 

In addition, arv-mount apparently gets terminated (maybe by slurm doing killpg?) but the run directory is left in /tmp and there is a dangling mountpoint in mtab.

Looking at compute0.9tee4, I saw evidence (garbage in /tmp) that this has happened before.


Subtasks 1 (0 open1 closed)

Task #13030: Review 13022-tmp-cleanupResolvedTom Clegg02/05/2018Actions

Related issues

Related to Arvados - Bug #13095: when slurm murders a crunch2 job because it exceeds the memory limit, the container is left with a null `log`ClosedJoshua RandallActions
Actions #1

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 6 years ago

  • Status changed from In Progress to New
Actions #3

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)
Actions #5

Updated by Tom Clegg about 6 years ago

  • Assigned To set to Tom Clegg
Actions #6

Updated by Tom Clegg about 6 years ago

End of slurm-2297.out on compute0.9tee4, whose temp dir was not removed:

9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:46.856481052Z Complete
9tee4-dz642-ml66hs2c3ook85z 2018-02-02T21:18:47.260490146Z Running [arv-mount --unmount-timeout=8 --unmount /tmp/crunch-run.9tee4-dz642-ml66hs2c3ook85z.721044901/keep566145728]
slurmstepd: error: *** JOB 2297 CANCELLED AT 2018-02-02T21:18:47 *** on compute0
slurmstepd: error: _slurm_cgroup_destroy: problem deleting step cgroup path /sys/fs/cgroup/freezer/slurm/uid_0/job_2297/step_batch: Device or resource busy

Calling stopSignals() before CleanupDirs() means we abandon CleanupDirs() when crunch-dispatch-slurm sends a[nother] TERM signal after the container has exited.

AFAIK we always want to do an orderly shutdown no matter when we get SIGTERM, so the solution seems to be
  1. remove stopSignals() entirely
  2. hold cStateLock in CommitLogs() to prevent the signal handler from using CrunchLog while CommitLogs is closing it and swapping it out for a new one
Actions #7

Updated by Tom Clegg about 6 years ago

  • Status changed from New to In Progress
Actions #9

Updated by Lucas Di Pentima about 6 years ago

LGTM

Actions #10

Updated by Anonymous about 6 years ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by Tom Clegg about 6 years ago

  • Related to Bug #13095: when slurm murders a crunch2 job because it exceeds the memory limit, the container is left with a null `log` added
Actions #12

Updated by Tom Morris almost 6 years ago

  • Release set to 17
Actions

Also available in: Atom PDF