Actions
Bug #6996
closed[Node Manager] a node filled up its /tmp which killed slurm and docker, node unusable. Node manager did not kill the node.
Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-
Description
A job on (then) compute0 filled up all of /tmp by writing .sam files (in the wrong place?):
rootfs 10188088 2603204 7044316 27% / udev 10240 0 10240 0% /dev tmpfs 6197672 384308 5813364 7% /run /dev/disk/by-uuid/87239a97-c5a4-4a87-b545-606ffbff926c 10188088 2603204 7044316 27% / tmpfs 5120 0 5120 0% /run/lock tmpfs 12395320 0 12395320 0% /run/shm none 262144 32 262112 1% /var/tmp cgroup 30988348 0 30988348 0% /sys/fs/cgroup /dev/mapper/tmp 393021956 393021936 20 100% /tmp
compute0.qr1hi:/# du -sh /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1 367G /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1
This caused slurm to die:
compute0.qr1hi:/var/log# tail daemon-20150814 2015-08-14T20:52:02+00:00 Last ping at 2015-08-14T20:52:02.059218000Z 2015-08-14T21:11:32+00:00 Node configuration differs from hardware 2015-08-14T21:58:39+00:00 [26637]: done with job 2015-08-14T21:58:40+00:00 lllp_distribution jobid [6157] implicit auto binding: sockets, dist 1 2015-08-14T21:58:40+00:00 _task_layout_lllp_cyclic 2015-08-14T21:58:40+00:00 _lllp_generate_cpu_bind jobid [6157]: mask_cpu, 0xFFFF 2015-08-14T21:58:40+00:00 launch task 6157.7 request from 4005.4005@10.23.153.113 (port 54682) 2015-08-14T21:58:40+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device 2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device 2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
Docker also died:
compute0.qr1hi:/var/log# docker ps FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?
Node manager did not kill off this node. It should have, because it was no longer functional.
The effect was that a node was up, a diagnostic jobs was waiting, and node manager didn't spin up more nodes because there was a node already.
Only when a second job was queued did node manager bring up a new node (now also called compute0 - so slurm/api realized the other one was dead!).
I killed the offending node.
I killed the node off
Actions