Bug #5425
Updated by Peter Amstutz over 9 years ago
https://workbench.qr1hi.arvadosapi.com/jobs/qr1hi-8i9sb-7jc0nde0tqv3u6y I have seen repeated SLURM node failures when re-running Sally's killer job of doom. We should find out why. https://workbench.qr1hi.arvadosapi.com/jobs/qr1hi-8i9sb-7jc0nde0tqv3u6y https://workbench.qr1hi.arvadosapi.com/jobs/qr1hi-8i9sb-exy86qos9p66mbb https://workbench.qr1hi.arvadosapi.com/jobs/qr1hi-8i9sb-5v3jd4d1va26a0p Example log: <pre> 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 29 stderr crunchstat: mem 3624960 cache 25346048 swap 5014 pgmajfault 214380544 rss 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 29 stderr crunchstat: cpu 442.5000 user 11.4500 sys 8 cpus -- interval 10.1232 seconds 4.7900 user 0.0400 sys 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 29 stderr crunchstat: blkio:202:32 0 write 37167104 read -- interval 10.1231 seconds 0 write 446464 read 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 29 stderr crunchstat: blkio:202:16 0 write 21766144 read -- interval 10.1231 seconds 0 write 0 read 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 29 stderr crunchstat: net:eth0 201676172 tx 1899475 rx -- interval 10.1232 seconds 0 tx 0 rx 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 26 stderr crunchstat: mem 3215360 cache 64135168 swap 9609 pgmajfault 655527936 rss 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 26 stderr crunchstat: cpu 1021.2800 user 34.5800 sys 8 cpus -- interval 10.0002 seconds 3.3500 user 1.2500 sys 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 26 stderr crunchstat: blkio:202:32 0 write 71942144 read -- interval 10.0002 seconds 0 write 4063232 read 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 26 stderr crunchstat: blkio:202:16 0 write 31727616 read -- interval 10.0002 seconds 0 write 0 read 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 26 stderr crunchstat: net:eth0 1291918100 tx 4381065 rx -- interval 10.0001 seconds 283839002 tx 648240 rx 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 stderr crunchstat: mem 3481600 cache 68702208 swap 31962 pgmajfault 525033472 rss 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 stderr crunchstat: cpu 1408.4100 user 49.9600 sys 8 cpus -- interval 9.9997 seconds 6.1600 user 0.1400 sys 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 stderr crunchstat: blkio:202:16 0 write 4444160 read -- interval 9.9996 seconds 0 write 77824 read 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 stderr crunchstat: blkio:202:32 0 write 288575488 read -- interval 9.9996 seconds 0 write 176128 read 2015-03-09_14:30:23 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 stderr crunchstat: net:eth0 1612714910 tx 5267991 rx -- interval 9.9996 seconds 0 tx 0 rx 2015-03-09_14:30:24 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 30 stderr crunchstat: mem 3399680 cache 25018368 swap 5972 pgmajfault 275226624 rss 2015-03-09_14:30:24 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 30 stderr crunchstat: cpu 517.9000 user 18.2500 sys 8 cpus -- interval 10.4004 seconds 5.3300 user 0.0800 sys 2015-03-09_14:30:24 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 30 stderr crunchstat: blkio:202:32 0 write 42160128 read -- interval 10.4003 seconds 0 write 0 read 2015-03-09_14:30:24 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 30 stderr crunchstat: blkio:202:16 0 write 770048 read -- interval 10.4003 seconds 0 write 0 read 2015-03-09_14:30:24 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 30 stderr crunchstat: net:eth0 403174439 tx 2484773 rx -- interval 10.4003 seconds 0 tx 0 rx 2015-03-09_14:30:25 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 31 stderr crunchstat: mem 3248128 cache 34291712 swap 1019 pgmajfault 243732480 rss 2015-03-09_14:30:25 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 31 stderr crunchstat: cpu 382.0000 user 13.2300 sys 8 cpus -- interval 10.0001 seconds 4.0700 user 0.0500 sys 2015-03-09_14:30:25 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 31 stderr crunchstat: blkio:202:32 0 write 7876608 read -- interval 10.0001 seconds 0 write 212992 read 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 33 stderr srun: error: Node failure on compute19 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 backing off node compute19 for 60 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 32 stderr srun: error: Node failure on compute19 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 backing off node compute19 for 60 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 26 stderr srun: error: Node failure on compute19 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 backing off node compute19 for 60 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 34 stderr srun: error: Node failure on compute19 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 backing off node compute19 for 60 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 30 stderr srun: error: Node failure on compute19 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 backing off node compute19 for 60 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 stderr srun: error: Node failure on compute19 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 backing off node compute19 for 60 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 31 stderr srun: error: Node failure on compute19 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 backing off node compute19 for 60 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 29 stderr srun: error: Node failure on compute19 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 backing off node compute19 for 60 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 child 1486 on compute19.2 exit 0 success= 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 failure (#1, temporary ) after 3006 seconds 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 24 task output (0 bytes): 2015-03-09_14:34:59 qr1hi-8i9sb-7jc0nde0tqv3u6y 18871 Every node has failed -- giving up on this round </pre>