Bug #4410
Updated by Brett Smith over 9 years ago
h2. Change to implement
(Draft) approach to the "retry" vs. "state machine" problem:
* crunch-job exits EX_RETRY_UNLOCKED if it gets a slurm-related TEMPFAIL from task 0. This must be different than EX_TEMPFAIL so crunch-dispatch can distinguish "I got the lock, retry at will" from "job already locked by another dispatch/job process".
* crunch-dispatch retries the job (at most N times) if crunch-job exits EX_RETRY_UNLOCKED.
Caveat:
* This won't be able to fail over to a different crunch-dispatch process. That would require an "unlock and put back in queue" API plus a new job attribute to track the number of attempts.
Logs:
* The easy answer is to overwrite attempt 1's log manifest with attempt 2's log manifest, but this would be confusing: the event bus would disagree with the "log" attribute, and in the long term the earlier attempts' logs would be lost.
* Possible solution: in crunch-job, instead of blindly overwriting the log attribute, read the existing log manifest from the API server, concatenate the new log manifest to it, and save the resulting combined manifest. This will involve updating @log_writer_*@ to use @arv-put --manifest@.
h2. Original bug report
crunch-job currently has logic to detect node failures, and retry the task that was running when they happen. However, at least sometimes, this strategy is doomed to fail: for certain kinds of node failures, SLURM automatically revokes the job authorization, meaning there's no way to retry the job.
This just happened to qr1hi-8i9sb-3o4sdx3ekdmck42. The job was writing progress normally, then there was radio silence for five minutes, then this happened, from the crunch-dispatch logs:
<pre>2014-11-04_11:16:29.02602 qr1hi-8i9sb-3o4sdx3ekdmck42 ! salloc: Job allocation 2164 has been revoked.
2014-11-04_11:16:29.02607 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: error: Node failure on compute18
2014-11-04_11:16:29.02612 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 backing off node compute18 for 60 seconds
2014-11-04_11:16:29.02616 qr1hi-8i9sb-3o4sdx3ekdmck42 ! salloc: error: Node failure on compute18
2014-11-04_11:16:29.02621 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 21038 on compute18.1 exit 0 success=
2014-11-04_11:16:29.02626 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 failure (#1, temporary ) after 1635 seconds
2014-11-04_11:16:29.02631 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 output
2014-11-04_11:16:29.02635 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 Every node has failed -- giving up on this round
2014-11-04_11:16:29.02639 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 wait for last 0 children to finish
2014-11-04_11:16:29.59793 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 status: 9 done, 0 running, 2 todo
2014-11-04_11:16:29.59797 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 start level 1
2014-11-04_11:16:29.59798 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 status: 9 done, 0 running, 2 todo
2014-11-04_11:16:29.59800 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 job_task qr1hi-ot0gb-sjnzw4i7fyf64o5
2014-11-04_11:16:29.59802 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 29199 started on compute18.1
2014-11-04_11:16:29.59804 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: error: SLURM job 2164 has expired.
2014-11-04_11:16:29.59805 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2014-11-04_11:16:29.59807 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 29199 on compute18.1 exit 1 success=false
2014-11-04_11:16:29.59811 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 failure (#2, temporary ) after 0 seconds
2014-11-04_11:16:29.59813 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 output
2014-11-04_11:16:29.59815 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 Every node has failed -- giving up on this round
2014-11-04_11:16:29.59817 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 wait for last 0 children to finish
2014-11-04_11:16:30.29339 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 status: 9 done, 0 running, 2 todo
2014-11-04_11:16:30.29343 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 release job allocation
2014-11-04_11:16:30.29349 qr1hi-8i9sb-3o4sdx3ekdmck42 ! scancel: error: Kill job error on job id 2164: Invalid job id specified
</pre>
Note that crunch-job takes a different branch if the job runs parallel tasks. In that case, the root problem seems to be the same, but crunch-job sees the multiple consecutive failures and writes off the node completely. More crunch-dispatch logs, about 5m30s after the last crunchstat log:
<pre>2014-11-05_22:58:09.32582 qr1hi-8i9sb-v6drlfmnikkkatu ! salloc: Job allocation 2204 has been revoked.
2014-11-05_22:58:09.35447 qr1hi-8i9sb-v6drlfmnikkkatu ! salloc: error: Node failure on compute3
2014-11-05_22:58:09.42573 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42580 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42584 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42588 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42593 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42597 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42601 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42607 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42611 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42615 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42619 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42623 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42628 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42632 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42637 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42642 qr1hi-8i9sb-v6drlfmnikkkatu 12516 backing off node compute3 for 60 seconds
2014-11-05_22:58:09.47930 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 child 24108 on compute3.1 exit 0 success=
2014-11-05_22:58:09.66058 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 failure (#1, temporary ) after 1036 seconds
2014-11-05_22:58:10.14412 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 output
2014-11-05_22:58:10.14423 qr1hi-8i9sb-v6drlfmnikkkatu 12516 Every node has failed -- giving up on this round
2014-11-05_22:58:10.14431 qr1hi-8i9sb-v6drlfmnikkkatu 12516 wait for last 7 children to finish
2014-11-05_22:58:10.14440 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 child 24117 on compute3.2 exit 0 success=
2014-11-05_22:58:10.14447 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 failure (#1, temporary ) after 1036 seconds
2014-11-05_22:58:10.21059 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 output
2014-11-05_22:58:10.28216 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 child 24126 on compute3.3 exit 0 success=
2014-11-05_22:58:10.51532 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:10.64624 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 output
2014-11-05_22:58:10.69064 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 child 24139 on compute3.4 exit 0 success=
2014-11-05_22:58:10.99143 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:10.99241 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 output
2014-11-05_22:58:11.03886 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 child 24156 on compute3.5 exit 0 success=
2014-11-05_22:58:11.22759 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:11.30891 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 output
2014-11-05_22:58:11.35442 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 child 24171 on compute3.6 exit 0 success=
2014-11-05_22:58:11.58957 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:11.69667 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 output
2014-11-05_22:58:12.09560 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 child 24181 on compute3.7 exit 0 success=
2014-11-05_22:58:12.09567 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:12.17547 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 output
2014-11-05_22:58:12.23033 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 child 24190 on compute3.8 exit 0 success=
2014-11-05_22:58:12.46902 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 failure (#1, temporary ) after 1038 seconds
2014-11-05_22:58:12.58698 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 output
2014-11-05_22:58:12.75142 qr1hi-8i9sb-v6drlfmnikkkatu 12516 status: 1 done, 0 running, 12 todo
2014-11-05_22:58:12.75144 qr1hi-8i9sb-v6drlfmnikkkatu 12516 stop because 8 tasks failed and none succeeded
2014-11-05_22:58:13.12727 qr1hi-8i9sb-v6drlfmnikkkatu 12516 release job allocation
2014-11-05_22:58:13.12735 qr1hi-8i9sb-v6drlfmnikkkatu ! scancel: error: Kill job error on job id 2204: Invalid job id specified
</pre>
In cases like this, crunch-job should exit with tempfail status, to signal to crunch-dispatch that the job can be retried once it acquires a new SLURM allocation.