Project

General

Profile

Actions

Bug #6996

closed

[Node Manager] a node filled up its /tmp which killed slurm and docker, node unusable. Node manager did not kill the node.

Added by Ward Vandewege over 8 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

A job on (then) compute0 filled up all of /tmp by writing .sam files (in the wrong place?):

rootfs                                                  10188088   2603204   7044316  27% /
udev                                                       10240         0     10240   0% /dev
tmpfs                                                    6197672    384308   5813364   7% /run
/dev/disk/by-uuid/87239a97-c5a4-4a87-b545-606ffbff926c  10188088   2603204   7044316  27% /
tmpfs                                                       5120         0      5120   0% /run/lock
tmpfs                                                   12395320         0  12395320   0% /run/shm
none                                                      262144        32    262112   1% /var/tmp
cgroup                                                  30988348         0  30988348   0% /sys/fs/cgroup
/dev/mapper/tmp                                        393021956 393021936        20 100% /tmp
compute0.qr1hi:/# du -sh /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1
367G    /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1

This caused slurm to die:

compute0.qr1hi:/var/log# tail daemon-20150814
2015-08-14T20:52:02+00:00 Last ping at 2015-08-14T20:52:02.059218000Z
2015-08-14T21:11:32+00:00 Node configuration differs from hardware
2015-08-14T21:58:39+00:00 [26637]: done with job
2015-08-14T21:58:40+00:00 lllp_distribution jobid [6157] implicit auto binding: sockets, dist 1
2015-08-14T21:58:40+00:00 _task_layout_lllp_cyclic 
2015-08-14T21:58:40+00:00 _lllp_generate_cpu_bind jobid [6157]: mask_cpu, 0xFFFF
2015-08-14T21:58:40+00:00 launch task 6157.7 request from 4005.4005@10.23.153.113 (port 54682)
2015-08-14T21:58:40+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device

Docker also died:

compute0.qr1hi:/var/log# docker ps
FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS? 

Node manager did not kill off this node. It should have, because it was no longer functional.

The effect was that a node was up, a diagnostic jobs was waiting, and node manager didn't spin up more nodes because there was a node already.

Only when a second job was queued did node manager bring up a new node (now also called compute0 - so slurm/api realized the other one was dead!).

I killed the offending node.

I killed the node off

Actions

Also available in: Atom PDF