Project

General

Profile

Bug #5943

Updated by Bryan Cosca almost 9 years ago

qr1hi-d1hrv-bgj6vtddg9cmyyt looks like it failed due to nodes down: 

 2015-05-07_18:33:05 salloc: error: Failed to allocate resources: Required node not available (down or drained) 
 2015-05-07_18:34:44 salloc: Granted job allocation 1226 
 2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    running from /usr/local/arvados/src/sdk/cli/bin/crunch-job with arvados-cli Gem version(s) 0.1.20150205181653, 0.1.20150128223752, 0.1.20150121183928, 0.1.20141209151444, 0.1.20141014201516, 0.1.20140919104705, 0.1.20140905165259, 0.1.20140827170424, 0.1.20140825141611, 0.1.20140812162850, 0.1.20140708213257, 0.1.20140707162447, 0.1.20140630151639, 0.1.20140513131358, 0.1.20140513101345, 0.1.20140414145041 
 2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    check slurm allocation 
 2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    node compute26 - 1 slots 
 2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    start 
 2015-05-07_18:34:46 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    Clean work dirs 
 2015-05-07_18:34:46 starting: ['srun','--nodelist=compute26','-D','/tmp','bash','-ec','mount -t fuse,fuse.keep | awk \'($3 ~ /\\ykeep\\y/){print $3}\' | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid'] 
 2015-05-07_18:34:48 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    Cleanup command exited 0 
 2015-05-07_18:34:48 starting: ['srun','--nodelist=compute26','/bin/sh','-ec',' if ! /usr/bin/docker.io images -q --no-trunc --all | grep -qxF f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8; then       arv-get 256f21bb3abfcd8e08a893886bf3e7c0\\+5082\\/f4eafaf1e2d738e0f8d947feb725b5945f0219c5c4956eec6e164a0788abbab8\\.tar | /usr/bin/docker.io load fi '] 
 2015-05-07_18:38:31 starting: ['srun','--nodelist=compute26','/bin/sh','-ec','/usr/bin/docker.io run --help | grep -qe --memory-swap='] 
 2015-05-07_18:38:31 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    Packing Arvados SDK version f0fe7273c1851cb93e9edd58c0b60d3590b222ed for installation 
 2015-05-07_18:38:32 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    Looking for version 83a9390a05bbffc2e4ea95dd693af3ab3547fa12 from repository arvados 
 2015-05-07_18:38:32 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    Using local repository '/var/lib/arvados/internal.git' 
 2015-05-07_18:38:32 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    Version 83a9390a05bbffc2e4ea95dd693af3ab3547fa12 is commit 83a9390a05bbffc2e4ea95dd693af3ab3547fa12 
 2015-05-07_18:38:32 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    Run install script on all workers 
 2015-05-07_18:38:32 starting: ['srun','--nodelist=compute26','-D','/tmp','--job-name=qr1hi-8i9sb-v2ojzh70vxl2xhh','sh','-c','mkdir -p /tmp/crunch-job/opt && cd /tmp/crunch-job && perl -'] 
 2015-05-07_18:38:33 srun: error: compute26: task 0: Exited with exit code 141 
 2015-05-07_18:38:33 qr1hi-8i9sb-v2ojzh70vxl2xhh 8317    Install script exited 141 
 2015-05-07_18:38:33 salloc: Relinquishing job allocation 1226 


 This might also help: 

 I also thought it was weird that the modified_by_user_uuid changed from qr1hi-tpzed-vm0nd4e7a013f8e to qr1hi-tpzed-000000000000000 

 I should also note that it would be really helpful if this job would re-run by itself if the node went down.

Back