Project

General

Profile

Bug #6996

Updated by Ward Vandewege over 8 years ago

A job on (then) compute0 filled up all of /tmp by writing .sam files (in the wrong place?): 

 <pre> 
 rootfs                                                    10188088     2603204     7044316    27% / 
 udev                                                         10240           0       10240     0% /dev 
 tmpfs                                                      6197672      384308     5813364     7% /run 
 /dev/disk/by-uuid/87239a97-c5a4-4a87-b545-606ffbff926c    10188088     2603204     7044316    27% / 
 tmpfs                                                         5120           0        5120     0% /run/lock 
 tmpfs                                                     12395320           0    12395320     0% /run/shm 
 none                                                        262144          32      262112     1% /var/tmp 
 cgroup                                                    30988348           0    30988348     0% /sys/fs/cgroup 
 /dev/mapper/tmp                                          393021956 393021936          20 100% /tmp 
 </pre> 

 <pre> 
 compute0.qr1hi:/# du -sh /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1 
 367G 	 /tmp/docker/vfs/dir/1f6b021d8440b037bbd34ceccf16a9f7770e6173e8fc1de9d6136d29140253dc/crunch-job-task-work/compute0.1 
 </pre> 

 This caused slurm to die: 

 <pre> 
 compute0.qr1hi:/var/log# tail daemon-20150814 
 2015-08-14T20:52:02+00:00 Last ping at 2015-08-14T20:52:02.059218000Z 
 2015-08-14T21:11:32+00:00 Node configuration differs from hardware 
 2015-08-14T21:58:39+00:00 [26637]: done with job 
 2015-08-14T21:58:40+00:00 lllp_distribution jobid [6157] implicit auto binding: sockets, dist 1 
 2015-08-14T21:58:40+00:00 _task_layout_lllp_cyclic  
 2015-08-14T21:58:40+00:00 _lllp_generate_cpu_bind jobid [6157]: mask_cpu, 0xFFFF 
 2015-08-14T21:58:40+00:00 launch task 6157.7 request from 4005.4005@10.23.153.113 (port 54682) 
 2015-08-14T21:58:40+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device 
 2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device 
 2015-08-14T21:58:49+00:00 error: write /tmp/slurmd/cred_state.new error No space left on device 
 </pre> 

 Docker also died: 

 <pre> 
 compute0.qr1hi:/var/log# docker ps 
 FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?  
 </pre> 

 Node manager did not kill off this node. It should have, because it was no longer functional. 

 The effect was that a node was up, a diagnostic jobs was waiting, and node manager didn't spin up more nodes because there was a node already. 

 Only when a second job was queued did node manager bring up a new node (now also called compute0 - so slurm/api realized the other one was dead!). 

 I killed the offending node. 

 I killed the node off

Back