Project

General

Profile

Bug #4410

Updated by Brett Smith about 10 years ago

crunch-job currently has logic to detect node failures, and retry the task that was running when they happen.    However, at least sometimes, this strategy is doomed to fail: for certain kinds of node failures, SLURM automatically revokes the job authorization, meaning there's no way to retry the job. 

 This just happened to qr1hi-8i9sb-3o4sdx3ekdmck42.    The job was writing progress normally, then there was radio silence for five minutes, then this happened, from the crunch-dispatch logs: 

 <pre>2014-11-04_11:16:29.02602 qr1hi-8i9sb-3o4sdx3ekdmck42 ! salloc: Job allocation 2164 has been revoked. 
 2014-11-04_11:16:29.02607 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: error: Node failure on compute18 
 2014-11-04_11:16:29.02612 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    backing off node compute18 for 60 seconds 
 2014-11-04_11:16:29.02616 qr1hi-8i9sb-3o4sdx3ekdmck42 ! salloc: error: Node failure on compute18 
 2014-11-04_11:16:29.02621 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 21038 on compute18.1 exit 0 success= 
 2014-11-04_11:16:29.02626 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 failure (#1, temporary ) after 1635 seconds 
 2014-11-04_11:16:29.02631 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 output 
 2014-11-04_11:16:29.02635 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    Every node has failed -- giving up on this round 
 2014-11-04_11:16:29.02639 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    wait for last 0 children to finish 
 2014-11-04_11:16:29.59793 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    status: 9 done, 0 running, 2 todo 
 2014-11-04_11:16:29.59797 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    start level 1 
 2014-11-04_11:16:29.59798 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    status: 9 done, 0 running, 2 todo 
 2014-11-04_11:16:29.59800 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 job_task qr1hi-ot0gb-sjnzw4i7fyf64o5 
 2014-11-04_11:16:29.59802 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 29199 started on compute18.1 
 2014-11-04_11:16:29.59804 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: error: SLURM job 2164 has expired. 
 2014-11-04_11:16:29.59805 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job. 
 2014-11-04_11:16:29.59807 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 29199 on compute18.1 exit 1 success=false 
 2014-11-04_11:16:29.59811 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 failure (#2, temporary ) after 0 seconds 
 2014-11-04_11:16:29.59813 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 output 
 2014-11-04_11:16:29.59815 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    Every node has failed -- giving up on this round 
 2014-11-04_11:16:29.59817 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    wait for last 0 children to finish 
 2014-11-04_11:16:30.29339 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    status: 9 done, 0 running, 2 todo 
 2014-11-04_11:16:30.29343 qr1hi-8i9sb-3o4sdx3ekdmck42 10454    release job allocation 
 2014-11-04_11:16:30.29349 qr1hi-8i9sb-3o4sdx3ekdmck42 ! scancel: error: Kill job error on job id 2164: Invalid job id specified 
 </pre> 

 Note that crunch-job takes a different branch if the job runs parallel tasks.    In that case, the root problem seems to be the same, but crunch-job sees the multiple consecutive failures and writes off the node completely.    More crunch-dispatch logs, about 5m30s after the last crunchstat log: 

 <pre>2014-11-05_22:58:09.32582 qr1hi-8i9sb-v6drlfmnikkkatu ! salloc: Job allocation 2204 has been revoked. 
 2014-11-05_22:58:09.35447 qr1hi-8i9sb-v6drlfmnikkkatu ! salloc: error: Node failure on compute3 
 2014-11-05_22:58:09.42573 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 stderr srun: error: Node failure on compute3 
 2014-11-05_22:58:09.42580 qr1hi-8i9sb-v6drlfmnikkkatu 12516    backing off node compute3 for 60 seconds 
 2014-11-05_22:58:09.42584 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 stderr srun: error: Node failure on compute3 
 2014-11-05_22:58:09.42588 qr1hi-8i9sb-v6drlfmnikkkatu 12516    backing off node compute3 for 60 seconds 
 2014-11-05_22:58:09.42593 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 stderr srun: error: Node failure on compute3 
 2014-11-05_22:58:09.42597 qr1hi-8i9sb-v6drlfmnikkkatu 12516    backing off node compute3 for 60 seconds 
 2014-11-05_22:58:09.42601 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 stderr srun: error: Node failure on compute3 
 2014-11-05_22:58:09.42607 qr1hi-8i9sb-v6drlfmnikkkatu 12516    backing off node compute3 for 60 seconds 
 2014-11-05_22:58:09.42611 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 stderr srun: error: Node failure on compute3 
 2014-11-05_22:58:09.42615 qr1hi-8i9sb-v6drlfmnikkkatu 12516    backing off node compute3 for 60 seconds 
 2014-11-05_22:58:09.42619 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 stderr srun: error: Node failure on compute3 
 2014-11-05_22:58:09.42623 qr1hi-8i9sb-v6drlfmnikkkatu 12516    backing off node compute3 for 60 seconds 
 2014-11-05_22:58:09.42628 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 stderr srun: error: Node failure on compute3 
 2014-11-05_22:58:09.42632 qr1hi-8i9sb-v6drlfmnikkkatu 12516    backing off node compute3 for 60 seconds 
 2014-11-05_22:58:09.42637 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 stderr srun: error: Node failure on compute3 
 2014-11-05_22:58:09.42642 qr1hi-8i9sb-v6drlfmnikkkatu 12516    backing off node compute3 for 60 seconds 
 2014-11-05_22:58:09.47930 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 child 24108 on compute3.1 exit 0 success= 
 2014-11-05_22:58:09.66058 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 failure (#1, temporary ) after 1036 seconds 
 2014-11-05_22:58:10.14412 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 output 
 2014-11-05_22:58:10.14423 qr1hi-8i9sb-v6drlfmnikkkatu 12516    Every node has failed -- giving up on this round 
 2014-11-05_22:58:10.14431 qr1hi-8i9sb-v6drlfmnikkkatu 12516    wait for last 7 children to finish 
 2014-11-05_22:58:10.14440 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 child 24117 on compute3.2 exit 0 success= 
 2014-11-05_22:58:10.14447 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 failure (#1, temporary ) after 1036 seconds 
 2014-11-05_22:58:10.21059 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 output 
 2014-11-05_22:58:10.28216 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 child 24126 on compute3.3 exit 0 success= 
 2014-11-05_22:58:10.51532 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 failure (#1, temporary ) after 1037 seconds 
 2014-11-05_22:58:10.64624 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 output 
 2014-11-05_22:58:10.69064 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 child 24139 on compute3.4 exit 0 success= 
 2014-11-05_22:58:10.99143 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 failure (#1, temporary ) after 1037 seconds 
 2014-11-05_22:58:10.99241 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 output 
 2014-11-05_22:58:11.03886 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 child 24156 on compute3.5 exit 0 success= 
 2014-11-05_22:58:11.22759 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 failure (#1, temporary ) after 1037 seconds 
 2014-11-05_22:58:11.30891 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 output 
 2014-11-05_22:58:11.35442 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 child 24171 on compute3.6 exit 0 success= 
 2014-11-05_22:58:11.58957 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 failure (#1, temporary ) after 1037 seconds 
 2014-11-05_22:58:11.69667 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 output 
 2014-11-05_22:58:12.09560 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 child 24181 on compute3.7 exit 0 success= 
 2014-11-05_22:58:12.09567 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 failure (#1, temporary ) after 1037 seconds 
 2014-11-05_22:58:12.17547 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 output 
 2014-11-05_22:58:12.23033 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 child 24190 on compute3.8 exit 0 success= 
 2014-11-05_22:58:12.46902 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 failure (#1, temporary ) after 1038 seconds 
 2014-11-05_22:58:12.58698 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 output 
 2014-11-05_22:58:12.75142 qr1hi-8i9sb-v6drlfmnikkkatu 12516    status: 1 done, 0 running, 12 todo 
 2014-11-05_22:58:12.75144 qr1hi-8i9sb-v6drlfmnikkkatu 12516    stop because 8 tasks failed and none succeeded 
 2014-11-05_22:58:13.12727 qr1hi-8i9sb-v6drlfmnikkkatu 12516    release job allocation 
 2014-11-05_22:58:13.12735 qr1hi-8i9sb-v6drlfmnikkkatu ! scancel: error: Kill job error on job id 2204: Invalid job id specified 
 </pre> 

 In cases like this, crunch-job should exit with tempfail status, to signal to crunch-dispatch that the job can be retried once it acquires a new SLURM allocation.

Back