Bug #6996

[Node Manager] a node filled up its /tmp which killed slurm and docker, node unusable. Node manager did not kill the node.

Added by Ward Vandewege about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

A job on (then) compute0 filled up all of /tmp by writing .sam files (in the wrong place?):

rootfs                                                  10188088   2603204   7044316  27% /
udev                                                       10240         0     10240   0% /dev
tmpfs                                                    6197672    384308   5813364   7% /run
/dev/disk/by-uuid/87239a97-c5a4-4a87-b545-606ffbff926c  10188088   2603204   7044316  27% /
tmpfs                                                       5120         0      5120   0% /run/lock
tmpfs                                                   12395320         0  12395320   0% /run/shm
none                                                      262144        32    262112   1% /var/tmp
cgroup                                                  30988348         0  30988348   0% /sys/fs/cgroup
/dev/mapper/tmp                                        393021956 393021936        20 100% /tmp
compute0.qr1hi:/# du -sh /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1
367G    /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1

This caused slurm to die:

compute0.qr1hi:/var/log# tail daemon-20150814
2015-08-14T20:52:02+00:00 Last ping at 2015-08-14T20:52:02.059218000Z
2015-08-14T21:11:32+00:00 Node configuration differs from hardware
2015-08-14T21:58:39+00:00 [26637]: done with job
2015-08-14T21:58:40+00:00 lllp_distribution jobid [6157] implicit auto binding: sockets, dist 1
2015-08-14T21:58:40+00:00 _task_layout_lllp_cyclic 
2015-08-14T21:58:40+00:00 _lllp_generate_cpu_bind jobid [6157]: mask_cpu, 0xFFFF
2015-08-14T21:58:40+00:00 launch task 6157.7 request from 4005.4005@10.23.153.113 (port 54682)
2015-08-14T21:58:40+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device
2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device

Docker also died:

compute0.qr1hi:/var/log# docker ps
FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS? 

Node manager did not kill off this node. It should have, because it was no longer functional.

The effect was that a node was up, a diagnostic jobs was waiting, and node manager didn't spin up more nodes because there was a node already.

Only when a second job was queued did node manager bring up a new node (now also called compute0 - so slurm/api realized the other one was dead!).

I killed the offending node.

I killed the node off

History

#1 Updated by Ward Vandewege about 4 years ago

  • Description updated (diff)

#2 Updated by Ward Vandewege about 4 years ago

  • Subject changed from [Node Manager] a node filled up its /tmp which killed slurm. Node manager did not kill the node. to [Node Manager] a node filled up its /tmp which killed slurm and docker, node unusable. Node manager did not kill the node.

#3 Updated by Brett Smith about 4 years ago

Remember that in the initial development of Node Manager, we decided it was best to take the conservative approach of only shutting down nodes that assert that they're idle. If nodes are in "weird" states, the expectation is that they might still be doing compute work locally, even if they're having trouble talking to SLURM or Arvados, so shutting them down risks losing the work. Remember also that Node Manager only knows what Arvados and the cloud tell it. It doesn't have any direct insight into the state of the compute node.

Looking at the crunch-dispatch logs, it marked compute0 in the SLURM alloc state starting at 21:19:04, then down at 00:01:20. What state did you want Node Manager to see and respond to?

#4 Updated by Brett Smith about 4 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints

Also available in: Atom PDF