Project

General

Profile

Actions

Bug #14358

closed

[crunch-run] Don't get stuck on ContainerWait

Added by Peter Amstutz over 5 years ago. Updated over 5 years ago.

Status:
Duplicate
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

Stuck container can't be cancelled. The job got stuck (for reasons unknown) and user attempted to cancel. On delete, the container returns a cleanup error. Subsequent attempts to delete the container then fail because the container is no longer present. A possible explanation is that this is hitting a bug in Docker which results in ContainerWait not getting a signal that the container has terminated.

Suggest that, similar to the "containerdGone" channel, there should be a channel that will be signaled by the stop() method if ContainerRemove() gets back an error.

Example log:

e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:26.850704287Z Starting Docker container id 'fad4decebdfe0009f2fc3d85dca3aa6e6d60120
056e7f86ee724f266ee91610b'
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:27.543947488Z Waiting for container to finish
slurmstepd: error: *** JOB 430894 ON compute26 CANCELLED AT 2018-10-15T17:27:38 ***
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:27:38.118527362Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:27:38.118562263Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:01.130851892Z error removing container: Error response from daemon: Unable to remove filesystem for fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: remove /tmp/docker/containers/fad4decebdfe000
9f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b/shm: device or resource busy
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215155424Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215196326Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215838654Z error removing container: Error: No such container: fad4decebdfe0009f
2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.092710107Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.092753409Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.093342235Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b

Files

syslog-20181016 (1.19 MB) syslog-20181016 Nico César, 10/16/2018 05:46 PM
Actions

Also available in: Atom PDF