Bug #14358

[crunch-run] Don't get stuck on ContainerWait

Added by Peter Amstutz about 2 months ago. Updated 1 day ago.

Status:
Duplicate
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Stuck container can't be cancelled. The job got stuck (for reasons unknown) and user attempted to cancel. On delete, the container returns a cleanup error. Subsequent attempts to delete the container then fail because the container is no longer present. A possible explanation is that this is hitting a bug in Docker which results in ContainerWait not getting a signal that the container has terminated.

Suggest that, similar to the "containerdGone" channel, there should be a channel that will be signaled by the stop() method if ContainerRemove() gets back an error.

Example log:

e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:26.850704287Z Starting Docker container id 'fad4decebdfe0009f2fc3d85dca3aa6e6d60120
056e7f86ee724f266ee91610b'
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:27.543947488Z Waiting for container to finish
slurmstepd: error: *** JOB 430894 ON compute26 CANCELLED AT 2018-10-15T17:27:38 ***
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:27:38.118527362Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:27:38.118562263Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:01.130851892Z error removing container: Error response from daemon: Unable to remove filesystem for fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: remove /tmp/docker/containers/fad4decebdfe000
9f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b/shm: device or resource busy
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215155424Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215196326Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215838654Z error removing container: Error: No such container: fad4decebdfe0009f
2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.092710107Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.092753409Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.093342235Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
syslog-20181016 (1.19 MB) syslog-20181016 Nico César, 10/16/2018 05:46 PM

History

#1 Updated by Peter Amstutz about 2 months ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz about 2 months ago

  • Status changed from In Progress to New

#3 Updated by Peter Amstutz about 2 months ago

  • Description updated (diff)

#5 Updated by Nico César about 2 months ago

-- Reboot --
Oct 14 16:46:17 compute-5h7fxhrtrbmlzwc-e51c5 systemd[1]: Stopped Docker Application Container Engine.
-- Subject: Unit docker.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has finished shutting down.
Oct 14 16:46:18 compute-5h7fxhrtrbmlzwc-e51c5 systemd[1]: Starting Docker Application Container Engine...
-- Subject: Unit docker.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has begun starting up.
Oct 14 16:46:19 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: Command "daemon" is deprecated, and will be removed in Docker 17.12. Please run `dockerd` directly.
Oct 14 16:46:21 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:21Z" level=warning msg="the \"-g / --graph\" flag is deprecated. Please use \"--data-root\" instead" 
Oct 14 16:46:21 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:21.626854905Z" level=info msg="libcontainerd: new containerd process, pid: 2623" 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.726011706Z" level=info msg="Graph migration to content-addressability took 0.00 seconds" 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.726559206Z" level=warning msg="Your kernel does not support swap memory limit" 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.726847606Z" level=warning msg="Your kernel does not support cgroup rt period" 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.727094806Z" level=warning msg="Your kernel does not support cgroup rt runtime" 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.727319806Z" level=warning msg="Your kernel does not support cgroup blkio weight" 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.727543806Z" level=warning msg="Your kernel does not support cgroup blkio weight_device" 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.729090906Z" level=info msg="Loading containers: start." 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.892980206Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" 
Oct 14 16:46:22 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:22.939202806Z" level=info msg="Loading containers: done." 
Oct 14 16:46:23 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:23.109814506Z" level=info msg="Daemon has completed initialization" 
Oct 14 16:46:23 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:23.110256706Z" level=info msg="Docker daemon" commit=89658be graphdriver=overlay2 version=17.05.0-ce
Oct 14 16:46:23 compute-5h7fxhrtrbmlzwc-e51c5 systemd[1]: Started Docker Application Container Engine.
-- Subject: Unit docker.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit docker.service has finished starting up.
--
-- The start-up result is done.
Oct 14 16:46:23 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:23.133331806Z" level=info msg="API listen on /var/run/docker.sock" 
Oct 14 16:46:23 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-14T16:46:23.135229006Z" level=info msg="API listen on /var/run/docker.sock" 
Oct 14 16:50:49 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-14T16:50:49.964738565Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" 
Oct 14 16:50:50 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-14T16:50:50.008549165Z" level=info msg="libcontainerd: new containerd process, pid: 3468" 
Oct 14 16:50:50 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-14T16:50:50.642549866Z" level=info msg="libcontainerd: new containerd process, pid: 3482" 
Oct 14 16:51:17 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-14T16:51:17.980321082Z" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap." 
Oct 15 16:03:51 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T16:03:51.374400579Z" level=error msg="Error running exec in container: rpc error: code = 2 desc = containerd: container not found" 
Oct 15 16:04:10 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T16:04:10.742079119Z" level=error msg="Error running exec in container: rpc error: code = 2 desc = containerd: container not found" 
Oct 15 16:05:08 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T16:05:08.844739012Z" level=error msg="Handler for GET /v1.29/containers/fad4decebdfe/logs returned error: configured logging driver does not support reading" 
Oct 15 16:05:31 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T16:05:31.190806860Z" level=error msg="Error running exec in container: rpc error: code = 2 desc = containerd: container not found" 
Oct 15 16:05:37 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T16:05:37.354718382Z" level=error msg="Error running exec in container: rpc error: code = 2 desc = containerd: container not found" 
Oct 15 17:27:38 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T17:27:38.121075473Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: rpc error: code = 2 desc = containerd: container not found" 
Oct 15 17:27:48 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T17:27:48.121621949Z" level=info msg="Container fad4decebdfe failed to exit within 10 seconds of kill - trying direct SIGKILL" 
Oct 15 17:27:48 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T17:27:48.122768299Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: rpc error: code = 2 desc = containerd: container not found" 
Oct 15 17:27:51 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T17:27:51.123253537Z" level=info msg="Container fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b failed to exit within 3 seconds of signal 15 - using the force" 
Oct 15 17:27:51 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T17:27:51.124254181Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: rpc error: code = 2 desc = containerd: container not found" 
Oct 15 17:28:01 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T17:28:01.124759226Z" level=info msg="Container fad4decebdfe failed to exit within 10 seconds of kill - trying direct SIGKILL" 
Oct 15 17:28:01 compute26.e51c5.arvadosapi.com docker[2592]: time="2018-10-15T17:28:01.130375571Z" level=error msg="Handler for DELETE /v1.21/containers/fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b returned error: Unable to remove filesystem for fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: remove /tmp/docker/containers/fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b/shm: device or resource busy" 

#6 Updated by Nico César about 2 months ago

2018/10/14 16:46:29 crunch-run 1.2.0 started
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:46:29.606691410Z crunch-run 1.2.0 started
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:46:29.606713610Z Executing container 'e51c5-dz642-kbfummr3ldql6hj'
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:46:29.606743410Z Executing on host 'compute26.e51c5.arvadosapi.com'
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:46:29.722527410Z Fetching Docker image from collection '0e2c58189fd35f203199f45bcc7e386e+1183'
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:46:29.737778210Z Using Docker image id 'sha256:09928be6a96dbc4f99620d02556a7369c25eaa0f70876ecb0fa761df154bbc34'
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:46:29.749469310Z Loading Docker image from keep
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:15.046677080Z Docker response: {"stream":"Loaded image ID: sha256:09928be6a96dbc4f99620d02556a7369c25eaa0f70876ecb0fa761df154bbc34\n"}
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:15.048043680Z Running [arv-mount --foreground --allow-other --read-write --crunchstat-interval=10 --file-cache 268435456 --mount-by-pdh by_id /tmp/crunch-run.e51c5-dz642-kbfummr3ldql6hj.286296978/keep142949833]
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:17.976343082Z Creating Docker container
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:26.602587987Z Attaching container streams
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:26.850704287Z Starting Docker container id 'fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b'
e51c5-dz642-kbfummr3ldql6hj 2018-10-14T16:51:27.543947488Z Waiting for container to finish
slurmstepd: error: *** JOB 430894 ON compute26 CANCELLED AT 2018-10-15T17:27:38 ***
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:27:38.118527362Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:27:38.118562263Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:01.130851892Z error removing container: Error response from daemon: Unable to remove filesystem for fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: remove /tmp/docker/containers/fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b/shm: device or resource busy
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215155424Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215196326Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:28:38.215838654Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.092710107Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.092753409Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:29:38.093342235Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:30:38.261673690Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:30:38.261714892Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:30:38.262416222Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:31:38.174371468Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-15T17:31:38.174413170Z removing container
(...)
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:27:38.008769071Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:27:38.009521702Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:28:38.016345646Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:28:38.016406849Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:28:38.017243983Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:29:05.185177320Z Arv-mount exit error: signal: killed
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:29:05.185248223Z arv-mount exited while container is still running.  Stopping container.
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:29:05.185260723Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:29:05.186643880Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:29:37.989733938Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:29:37.989799141Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:29:37.990687377Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:30:37.991591884Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:30:37.991677687Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:30:37.992551923Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:31:37.997006842Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:31:37.997087846Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:31:37.997953181Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:32:38.087336744Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:32:38.087409047Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:32:38.088271882Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:33:37.994278981Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:33:37.994328083Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:33:37.995158117Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:34:37.989349977Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:34:37.989410279Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:34:37.990157310Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:35:37.989128403Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:35:37.989167905Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:35:37.990140644Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:36:37.990408220Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:36:37.990489924Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:36:37.991114849Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:37:37.988427926Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:37:37.988479128Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:37:37.989243360Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:38:37.991012834Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:38:37.991062836Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:38:37.991894670Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:39:38.044821716Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:39:38.044862218Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:39:38.045774655Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:40:38.024908868Z caught signal: terminated
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:40:38.024982871Z removing container
e51c5-dz642-kbfummr3ldql6hj 2018-10-16T17:40:38.026296824Z error removing container: Error: No such container: fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b

#7 Updated by Nico César about 2 months ago

Attaching syslog about this

#8 Updated by Nico César about 2 months ago

syslog-20181016-Oct 15 17:26:01 compute-5h7fxhrtrbmlzwc-e51c5 CRON[12874]: (root) CMD (/bin/bash -c 'run-parts /usr/local/share/arvados-compute-ping-controller.d; source /etc/profile.d/rvm.sh && /usr/local/bin/arvados-compute-ping-controller.rb quiet')
syslog-20181016-Oct 15 17:26:01 compute-5h7fxhrtrbmlzwc-e51c5 arvados-compute-ping[12966]: Last ping at 2018-10-15T17:26:01.619318000Z
syslog-20181016-Oct 15 17:27:01 compute-5h7fxhrtrbmlzwc-e51c5 CRON[13884]: (root) CMD (/bin/bash -c 'run-parts /usr/local/share/arvados-compute-ping-controller.d; source /etc/profile.d/rvm.sh && /usr/local/bin/arvados-compute-ping-controller.rb quiet')
syslog-20181016-Oct 15 17:27:01 compute-5h7fxhrtrbmlzwc-e51c5 arvados-compute-ping[13994]: Last ping at 2018-10-15T17:27:01.946038000Z
syslog-20181016:Oct 15 17:27:38 compute-5h7fxhrtrbmlzwc-e51c5 slurmstepd[2889]: error: *** JOB 430894 ON compute26 CANCELLED AT 2018-10-15T17:27:38 ***
syslog-20181016-Oct 15 17:27:38 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-15T17:27:38.121075473Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: rpc error: code = 2 desc = containerd: container not found" 
syslog-20181016-Oct 15 17:27:48 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-15T17:27:48.121621949Z" level=info msg="Container fad4decebdfe failed to exit within 10 seconds of kill - trying direct SIGKILL" 
syslog-20181016-Oct 15 17:27:48 compute-5h7fxhrtrbmlzwc-e51c5 docker[2592]: time="2018-10-15T17:27:48.122768299Z" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container fad4decebdfe0009f2fc3d85dca3aa6e6d60120056e7f86ee724f266ee91610b: rpc error: code = 2 desc = containerd: container not found" 

#10 Updated by Peter Amstutz about 1 month ago

  • Status changed from New to Duplicate

#11 Updated by Tom Morris 1 day ago

  • Target version deleted (To Be Groomed)

Also available in: Atom PDF