[Node Manager] a node filled up its /tmp which killed slurm and docker, node unusable. Node manager did not kill the node.
A job on (then) compute0 filled up all of /tmp by writing .sam files (in the wrong place?):
rootfs 10188088 2603204 7044316 27% / udev 10240 0 10240 0% /dev tmpfs 6197672 384308 5813364 7% /run /dev/disk/by-uuid/87239a97-c5a4-4a87-b545-606ffbff926c 10188088 2603204 7044316 27% / tmpfs 5120 0 5120 0% /run/lock tmpfs 12395320 0 12395320 0% /run/shm none 262144 32 262112 1% /var/tmp cgroup 30988348 0 30988348 0% /sys/fs/cgroup /dev/mapper/tmp 393021956 393021936 20 100% /tmp
compute0.qr1hi:/# du -sh /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1 367G /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1
This caused slurm to die:
compute0.qr1hi:/var/log# tail daemon-20150814 2015-08-14T20:52:02+00:00 Last ping at 2015-08-14T20:52:02.059218000Z 2015-08-14T21:11:32+00:00 Node configuration differs from hardware 2015-08-14T21:58:39+00:00 : done with job 2015-08-14T21:58:40+00:00 lllp_distribution jobid  implicit auto binding: sockets, dist 1 2015-08-14T21:58:40+00:00 _task_layout_lllp_cyclic 2015-08-14T21:58:40+00:00 _lllp_generate_cpu_bind jobid : mask_cpu, 0xFFFF 2015-08-14T21:58:40+00:00 launch task 6157.7 request from email@example.com (port 54682) 2015-08-14T21:58:40+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device 2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device 2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
Docker also died:
compute0.qr1hi:/var/log# docker ps FATA Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?
Node manager did not kill off this node. It should have, because it was no longer functional.
The effect was that a node was up, a diagnostic jobs was waiting, and node manager didn't spin up more nodes because there was a node already.
Only when a second job was queued did node manager bring up a new node (now also called compute0 - so slurm/api realized the other one was dead!).
I killed the offending node.
I killed the node off
#3 Updated by Brett Smith almost 4 years ago
Remember that in the initial development of Node Manager, we decided it was best to take the conservative approach of only shutting down nodes that assert that they're idle. If nodes are in "weird" states, the expectation is that they might still be doing compute work locally, even if they're having trouble talking to SLURM or Arvados, so shutting them down risks losing the work. Remember also that Node Manager only knows what Arvados and the cloud tell it. It doesn't have any direct insight into the state of the compute node.
Looking at the crunch-dispatch logs, it marked compute0 in the SLURM alloc state starting at 21:19:04, then down at 00:01:20. What state did you want Node Manager to see and respond to?