Bug #4410
Updated by Brett Smith about 10 years ago
crunch-job currently has logic to detect node failures, and retry the task that was running when they happen. However, at least sometimes, this strategy is doomed to fail: for certain kinds of node failures, SLURM automatically revokes the job authorization, meaning there's no way to retry the job.
This just happened to qr1hi-8i9sb-3o4sdx3ekdmck42. The job was writing progress normally, then there was radio silence for five minutes, then this happened, from the crunch-dispatch logs:
<pre>2014-11-04_11:16:29.02602 qr1hi-8i9sb-3o4sdx3ekdmck42 ! salloc: Job allocation 2164 has been revoked.
2014-11-04_11:16:29.02607 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: error: Node failure on compute18
2014-11-04_11:16:29.02612 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 backing off node compute18 for 60 seconds
2014-11-04_11:16:29.02616 qr1hi-8i9sb-3o4sdx3ekdmck42 ! salloc: error: Node failure on compute18
2014-11-04_11:16:29.02621 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 21038 on compute18.1 exit 0 success=
2014-11-04_11:16:29.02626 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 failure (#1, temporary ) after 1635 seconds
2014-11-04_11:16:29.02631 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 output
2014-11-04_11:16:29.02635 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 Every node has failed -- giving up on this round
2014-11-04_11:16:29.02639 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 wait for last 0 children to finish
2014-11-04_11:16:29.59793 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 status: 9 done, 0 running, 2 todo
2014-11-04_11:16:29.59797 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 start level 1
2014-11-04_11:16:29.59798 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 status: 9 done, 0 running, 2 todo
2014-11-04_11:16:29.59800 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 job_task qr1hi-ot0gb-sjnzw4i7fyf64o5
2014-11-04_11:16:29.59802 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 29199 started on compute18.1
2014-11-04_11:16:29.59804 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: error: SLURM job 2164 has expired.
2014-11-04_11:16:29.59805 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2014-11-04_11:16:29.59807 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 29199 on compute18.1 exit 1 success=false
2014-11-04_11:16:29.59811 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 failure (#2, temporary ) after 0 seconds
2014-11-04_11:16:29.59813 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 output
2014-11-04_11:16:29.59815 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 Every node has failed -- giving up on this round
2014-11-04_11:16:29.59817 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 wait for last 0 children to finish
2014-11-04_11:16:30.29339 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 status: 9 done, 0 running, 2 todo
2014-11-04_11:16:30.29343 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 release job allocation
2014-11-04_11:16:30.29349 qr1hi-8i9sb-3o4sdx3ekdmck42 ! scancel: error: Kill job error on job id 2164: Invalid job id specified
</pre>
Note that crunch-job takes a different branch if the job runs parallel tasks. In that case, the root problem seems to be the same, but crunch-job sees the multiple consecutive failures and writes off the node completely. More crunch-dispatch logs, about 5m30s after the last crunchstat log:
<pre>2014-11-05_22:58:09.32582 qr1hi-8i9sb-v6drlfmnikkkatu ! salloc: Job allocation 2204 has been revoked.
2014-11-05_22:58:09.35447 qr1hi-8i9sb-v6drlfmnikkkatu ! salloc: error: Node failure on compute3
2014-11-05_22:58:09.42573 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42580 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42584 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42588 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42593 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42597 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42601 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42607 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42611 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42615 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42619 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42623 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42628 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42632 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42637 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42642 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.47930 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 child 24108 on compute3.1 exit 0 success=
2014-11-05_22:58:09.66058 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 failure (#1, temporary ) after 1036 seconds
2014-11-05_22:58:10.14412 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 output
2014-11-05_22:58:10.14423 qr1hi-8i9sb-v6drlfmnikkkatu 12516 Every node has failed -- giving up on this round
2014-11-05_22:58:10.14431 qr1hi-8i9sb-v6drlfmnikkkatu 12516 wait for last 7 children to finish
2014-11-05_22:58:10.14440 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 child 24117 on compute3.2 exit 0 success=
2014-11-05_22:58:10.14447 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 failure (#1, temporary ) after 1036 seconds
2014-11-05_22:58:10.21059 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 output
2014-11-05_22:58:10.28216 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 child 24126 on compute3.3 exit 0 success=
2014-11-05_22:58:10.51532 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:10.64624 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 output
2014-11-05_22:58:10.69064 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 child 24139 on compute3.4 exit 0 success=
2014-11-05_22:58:10.99143 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:10.99241 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 output
2014-11-05_22:58:11.03886 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 child 24156 on compute3.5 exit 0 success=
2014-11-05_22:58:11.22759 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:11.30891 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 output
2014-11-05_22:58:11.35442 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 child 24171 on compute3.6 exit 0 success=
2014-11-05_22:58:11.58957 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:11.69667 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 output
2014-11-05_22:58:12.09560 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 child 24181 on compute3.7 exit 0 success=
2014-11-05_22:58:12.09567 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:12.17547 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 output
2014-11-05_22:58:12.23033 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 child 24190 on compute3.8 exit 0 success=
2014-11-05_22:58:12.46902 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 failure (#1, temporary ) after 1038 seconds
2014-11-05_22:58:12.58698 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 output
2014-11-05_22:58:12.75142 qr1hi-8i9sb-v6drlfmnikkkatu 12516 status: 1 done, 0 running, 12 todo
2014-11-05_22:58:12.75144 qr1hi-8i9sb-v6drlfmnikkkatu 12516 stop because 8 tasks failed and none succeeded
2014-11-05_22:58:13.12727 qr1hi-8i9sb-v6drlfmnikkkatu 12516 release job allocation
2014-11-05_22:58:13.12735 qr1hi-8i9sb-v6drlfmnikkkatu ! scancel: error: Kill job error on job id 2204: Invalid job id specified
</pre>
In cases like this, crunch-job should exit with tempfail status, to signal to crunch-dispatch that the job can be retried once it acquires a new SLURM allocation.