Bug #6996
Updated by Ward Vandewege over 8 years ago
A job on (then) compute0 filled up all of /tmp by writing .sam files (in the wrong place?):
<pre>
rootfs 10188088 2603204 7044316 27% /
udev 10240 0 10240 0% /dev
tmpfs 6197672 384308 5813364 7% /run
/dev/disk/by-uuid/87239a97-c5a4-4a87-b545-606ffbff926c 10188088 2603204 7044316 27% /
tmpfs 5120 0 5120 0% /run/lock
tmpfs 12395320 0 12395320 0% /run/shm
none 262144 32 262112 1% /var/tmp
cgroup 30988348 0 30988348 0% /sys/fs/cgroup
/dev/mapper/tmp 393021956 393021936 20 100% /tmp
</pre>
<pre>
compute0.qr1hi:/# du -sh /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1
367G /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1
</pre>
This caused slurm to die:
<pre>
compute0.qr1hi:/var/log# tail daemon-20150814
2015-08-14T20:52:02+00:00 Last ping at 2015-08-14T20:52:02.059218000Z
2015-08-14T21:11:32+00:00 Node configuration differs from hardware
2015-08-14T21:58:39+00:00 [26637]: done with job
2015-08-14T21:58:40+00:00 lllp_distribution jobid [6157] implicit auto binding: sockets, dist 1
2015-08-14T21:58:40+00:00 _task_layout_lllp_cyclic
2015-08-14T21:58:40+00:00 _lllp_generate_cpu_bind jobid [6157]: mask_cpu, 0xFFFF
2015-08-14T21:58:40+00:00 launch task 6157.7 request from 4005.4005@10.23.153.113 (port 54682)
2015-08-14T21:58:40+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
</pre>
Docker also died:
<pre>
compute0.qr1hi:/var/log# docker ps
FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?
</pre>
Node manager did not kill off this node. It should have, because it was no longer functional.
The effect was that a node was up, a diagnostic jobs was waiting, and node manager didn't spin up more nodes because there was a node already.
Only when a second job was queued did node manager bring up a new node (now also called compute0 - so slurm/api realized the other one was dead!).
I killed the offending node.
I killed the node off