Bug #4410

[Crunch] crunch-job should exit tempfail when a SLURM node fails

Added by Brett Smith almost 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Crunch
Target version:
Start date:
11/04/2014
Due date:
% Done:

100%

Estimated time:
(Total: 4.00 h)
Story points:
2.0

Description

Change to implement

(Draft) approach to the "retry" vs. "state machine" problem:
  • crunch-job exits EX_RETRY_UNLOCKED if it gets a slurm-related TEMPFAIL from task 0. This must be different than EX_TEMPFAIL so crunch-dispatch can distinguish "I got the lock, retry at will" from "job already locked by another dispatch/job process".
  • crunch-dispatch retries the job (at most N times) if crunch-job exits EX_RETRY_UNLOCKED.
Caveat:
  • This won't be able to fail over to a different crunch-dispatch process. That would require an "unlock and put back in queue" API plus a new job attribute to track the number of attempts.
Logs:
  • The easy answer is to overwrite attempt 1's log manifest with attempt 2's log manifest, but this would be confusing: the event bus would disagree with the "log" attribute, and in the long term the earlier attempts' logs would be lost.
  • Possible solution: in crunch-job, instead of blindly overwriting the log attribute, read the existing log manifest from the API server, concatenate the new log manifest to it, and save the resulting combined manifest. This will involve updating log_writer_* to use arv-put --manifest.

Original bug report

crunch-job currently has logic to detect node failures, and retry the task that was running when they happen. However, at least sometimes, this strategy is doomed to fail: for certain kinds of node failures, SLURM automatically revokes the job authorization, meaning there's no way to retry the job.

This just happened to qr1hi-8i9sb-3o4sdx3ekdmck42. The job was writing progress normally, then there was radio silence for five minutes, then this happened, from the crunch-dispatch logs:

2014-11-04_11:16:29.02602 qr1hi-8i9sb-3o4sdx3ekdmck42 ! salloc: Job allocation 2164 has been revoked.
2014-11-04_11:16:29.02607 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: error: Node failure on compute18
2014-11-04_11:16:29.02612 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  backing off node compute18 for 60 seconds
2014-11-04_11:16:29.02616 qr1hi-8i9sb-3o4sdx3ekdmck42 ! salloc: error: Node failure on compute18
2014-11-04_11:16:29.02621 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 21038 on compute18.1 exit 0 success=
2014-11-04_11:16:29.02626 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 failure (#1, temporary ) after 1635 seconds
2014-11-04_11:16:29.02631 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 output
2014-11-04_11:16:29.02635 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  Every node has failed -- giving up on this round
2014-11-04_11:16:29.02639 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  wait for last 0 children to finish
2014-11-04_11:16:29.59793 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  status: 9 done, 0 running, 2 todo
2014-11-04_11:16:29.59797 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  start level 1
2014-11-04_11:16:29.59798 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  status: 9 done, 0 running, 2 todo
2014-11-04_11:16:29.59800 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 job_task qr1hi-ot0gb-sjnzw4i7fyf64o5
2014-11-04_11:16:29.59802 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 29199 started on compute18.1
2014-11-04_11:16:29.59804 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: error: SLURM job 2164 has expired.
2014-11-04_11:16:29.59805 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 stderr srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
2014-11-04_11:16:29.59807 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 child 29199 on compute18.1 exit 1 success=false
2014-11-04_11:16:29.59811 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 failure (#2, temporary ) after 0 seconds
2014-11-04_11:16:29.59813 qr1hi-8i9sb-3o4sdx3ekdmck42 10454 9 output
2014-11-04_11:16:29.59815 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  Every node has failed -- giving up on this round
2014-11-04_11:16:29.59817 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  wait for last 0 children to finish
2014-11-04_11:16:30.29339 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  status: 9 done, 0 running, 2 todo
2014-11-04_11:16:30.29343 qr1hi-8i9sb-3o4sdx3ekdmck42 10454  release job allocation
2014-11-04_11:16:30.29349 qr1hi-8i9sb-3o4sdx3ekdmck42 ! scancel: error: Kill job error on job id 2164: Invalid job id specified

Note that crunch-job takes a different branch if the job runs parallel tasks. In that case, the root problem seems to be the same, but crunch-job sees the multiple consecutive failures and writes off the node completely. More crunch-dispatch logs, about 5m30s after the last crunchstat log:

2014-11-05_22:58:09.32582 qr1hi-8i9sb-v6drlfmnikkkatu ! salloc: Job allocation 2204 has been revoked.
2014-11-05_22:58:09.35447 qr1hi-8i9sb-v6drlfmnikkkatu ! salloc: error: Node failure on compute3
2014-11-05_22:58:09.42573 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42580 qr1hi-8i9sb-v6drlfmnikkkatu 12516  backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42584 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42588 qr1hi-8i9sb-v6drlfmnikkkatu 12516  backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42593 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42597 qr1hi-8i9sb-v6drlfmnikkkatu 12516  backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42601 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42607 qr1hi-8i9sb-v6drlfmnikkkatu 12516  backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42611 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42615 qr1hi-8i9sb-v6drlfmnikkkatu 12516  backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42619 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42623 qr1hi-8i9sb-v6drlfmnikkkatu 12516  backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42628 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42632 qr1hi-8i9sb-v6drlfmnikkkatu 12516  backing off node compute3 for 60 seconds
2014-11-05_22:58:09.42637 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 stderr srun: error: Node failure on compute3
2014-11-05_22:58:09.42642 qr1hi-8i9sb-v6drlfmnikkkatu 12516  backing off node compute3 for 60 seconds
2014-11-05_22:58:09.47930 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 child 24108 on compute3.1 exit 0 success=
2014-11-05_22:58:09.66058 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 failure (#1, temporary ) after 1036 seconds
2014-11-05_22:58:10.14412 qr1hi-8i9sb-v6drlfmnikkkatu 12516 1 output
2014-11-05_22:58:10.14423 qr1hi-8i9sb-v6drlfmnikkkatu 12516  Every node has failed -- giving up on this round
2014-11-05_22:58:10.14431 qr1hi-8i9sb-v6drlfmnikkkatu 12516  wait for last 7 children to finish
2014-11-05_22:58:10.14440 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 child 24117 on compute3.2 exit 0 success=
2014-11-05_22:58:10.14447 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 failure (#1, temporary ) after 1036 seconds
2014-11-05_22:58:10.21059 qr1hi-8i9sb-v6drlfmnikkkatu 12516 2 output
2014-11-05_22:58:10.28216 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 child 24126 on compute3.3 exit 0 success=
2014-11-05_22:58:10.51532 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:10.64624 qr1hi-8i9sb-v6drlfmnikkkatu 12516 3 output
2014-11-05_22:58:10.69064 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 child 24139 on compute3.4 exit 0 success=
2014-11-05_22:58:10.99143 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:10.99241 qr1hi-8i9sb-v6drlfmnikkkatu 12516 4 output
2014-11-05_22:58:11.03886 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 child 24156 on compute3.5 exit 0 success=
2014-11-05_22:58:11.22759 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:11.30891 qr1hi-8i9sb-v6drlfmnikkkatu 12516 5 output
2014-11-05_22:58:11.35442 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 child 24171 on compute3.6 exit 0 success=
2014-11-05_22:58:11.58957 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:11.69667 qr1hi-8i9sb-v6drlfmnikkkatu 12516 6 output
2014-11-05_22:58:12.09560 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 child 24181 on compute3.7 exit 0 success=
2014-11-05_22:58:12.09567 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 failure (#1, temporary ) after 1037 seconds
2014-11-05_22:58:12.17547 qr1hi-8i9sb-v6drlfmnikkkatu 12516 7 output
2014-11-05_22:58:12.23033 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 child 24190 on compute3.8 exit 0 success=
2014-11-05_22:58:12.46902 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 failure (#1, temporary ) after 1038 seconds
2014-11-05_22:58:12.58698 qr1hi-8i9sb-v6drlfmnikkkatu 12516 8 output
2014-11-05_22:58:12.75142 qr1hi-8i9sb-v6drlfmnikkkatu 12516  status: 1 done, 0 running, 12 todo
2014-11-05_22:58:12.75144 qr1hi-8i9sb-v6drlfmnikkkatu 12516  stop because 8 tasks failed and none succeeded
2014-11-05_22:58:13.12727 qr1hi-8i9sb-v6drlfmnikkkatu 12516  release job allocation
2014-11-05_22:58:13.12735 qr1hi-8i9sb-v6drlfmnikkkatu ! scancel: error: Kill job error on job id 2204: Invalid job id specified

In cases like this, crunch-job should exit with tempfail status, to signal to crunch-dispatch that the job can be retried once it acquires a new SLURM allocation.


Subtasks

Task #6283: Review 4410-slurm-fails-are-tempfails-wipResolvedBrett Smith


Related issues

Related to Arvados - Story #3795: [Crunch/SDKs] Tasks need more retry supportClosed09/03/2014

Related to Arvados - Story #4599: [Crunch] crunch-job should not retry tasks after its SLURM allocation is revokedClosed

Associated revisions

Revision ad7679cf
Added by Brett Smith over 6 years ago

Merge branch '4410-slurm-fails-are-tempfails-wip'

Closes #4410, #6283.

History

#1 Updated by Bryan Cosca almost 7 years ago

#2 Updated by Ward Vandewege almost 7 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints

#3 Updated by Bryan Cosca almost 7 years ago

Here is a job that crashed because of compute3:

qr1hi-8i9sb-v6drlfmnikkkatu

#4 Updated by Brett Smith almost 7 years ago

  • Priority changed from Normal to High
  • Target version changed from Arvados Future Sprints to Bug Triage

#5 Updated by Brett Smith almost 7 years ago

  • Description updated (diff)

#6 Updated by Brett Smith almost 7 years ago

I spent some time looking at the recent jobs on qr1hi to see if I could piece together a cause. Nothing definitive yet, but some interesting results. Short identifiers here are the first characters of the last section of the job UUIDs in question.

3o4s, b9bs, and ziak all ran on the 4th, all ran on compute18, and all failed. However, several other jobs ran on compute18 throughout the day and succeeded. There is some chance that these didn't all run on the same compute18, but b9bs and ziak definitely did, and they're bookended by successful jobs that also did (qr1hi-8i9sb-krshn315rkb3d0n and qr1hi-8i9sb-jr1lvqj8xskwf41). So whatever was causing the SLURM error, it was not permanent, either on SLURM or the underlying node.

While all this was going on, qr1hi-8i9sb-8nbz2sjg56vizq4 took a while and ran to completion on qr1hi (started_at 2014-11-04T16:20:24Z, finished_at 2014-11-05T05:22:16Z). It took more time and resources than most of the failed jobs, which errored out around the 20-minute mark. (3o4s is the exception, erroring out after 11 hours.)

grepping the Crunch logs on qr1hi shows that this is happening even more than has been reported, and that while there's been a definite uptick recently, it's not wholly new—there are incidents on the 24th and 30th as well. Right now it looks like the tempfail exit suggested here would actually be a useful improvement, as long as crunch-dispatch doesn't immediately reassign the same job to the same node. Retrying and failing every time would be sad.

#7 Updated by Brett Smith almost 7 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Bug Triage to 2014-11-19 sprint

#8 Updated by Brett Smith almost 7 years ago

The fix for this is not simple.

Node allocation is handled by crunch-dispatch. It has all the logic to figure out which nodes can satisfy a job, actually do the salloc, and cope if that doesn't work, so we really want to leave all these decisions to crunch-dispatch.

However, crunch-dispatch's current tempfail handling is written with the assumption that we're in one of two cases: either crunch-job couldn't start the job at all, or we lost a race with another crunch-dispatch and some other crunch-job is running it. Because of that, it simply leaves the job in its current state. If we couldn't start the job at all, it's still queued, and we'll pick it up again on the next go-round. If we lost a race, then the other crunch-job will see it through to completion.

This behavior is not appropriate to handle cases like in this bug. Here, we started running a job, but now there's been a temporary failure and we need to re-queue it. You might think crunch-job could put the job back in the queued state before it exits, but the Job model prohibits going from the Running state back to Queued, so that's out. And crunch-dispatch has no concept of a queue outside of what the API server presents, so bypassing the API server and doing our own retrying would require adding a whole new source of jobs.

The simplest fix would be to relax the state transition restriction in the Job model, then replace all of crunch-job's current SLURM retry code with "set the job state to queued and exit tempfail." But I don't know if that's unacceptable for other reasons.

#9 Updated by Brett Smith almost 7 years ago

There wasn't a lot of love for my last suggestion. We have all this locking and state tracking code for a reason, to avoid other kinds of job state bugs. It doesn't seem worth it for the brute force solution of retrying entire jobs because of a single SLURM task failure.

One idea suggested on IRC is to reconfigure SLURM to cancel jobs less aggressively in these cases. And there's definitely sense to that; other components of ours (like Node Manager) conservatively try to avoid losing significant compute work, so it would make sense to have SLURM follow the same philosophy.

It looks like the compute nodes are losing contact with the SLURM controller long enough that SLURM's SlurmdTimeout passes, which causes the controller to set the node's state DOWN. This is consistent with both the job logs (where we stop getting diagnostic output) and SLURM's own daemon logs. Our current SlurmdTimeout is five minutes, which is also consistent with the timing in the job logs.

We could increase SlurmdTimeout. However, it's worth bearing in mind that this is the only way that SLURM figures out a node is down right now. When we handle #4380 (also on this sprint), Node Manager will more proactively notify SLURM about intentional node shutdowns, so the normal case wouldn't be much of an issue. However, it does mean nodes would be in an indeterminate state for longer in real cases of complete node failure.

#10 Updated by Tom Clegg almost 7 years ago

  • Target version deleted (2014-11-19 sprint)

#11 Updated by Tom Clegg almost 7 years ago

  • Priority changed from High to Normal

#12 Updated by Tom Clegg almost 7 years ago

  • Target version set to Deferred

#13 Updated by Tom Clegg over 6 years ago

We recently ran a set of jobs on a cloud-backed site which caused so much node churn that the one "compute node failed" event was [a] real, as far as we could tell (it wasn't just slurmctld being sensitive), and [b] sufficiently rare relative to the amount of hardware involved that "run everything again and hope you're not so unlucky this time" would not have been a reasonable response.

One reason node churn was so high was that the jobs used the pattern Crunch "pipelines-and-tasks-are-just-jobs" v2 will use: i.e., instead of using crunch-job's mapreduce features, jobs simply queued other jobs and waited for them to finish. With this approach, crunch-job only ever runs one task (number 0). This represents a special error-handling case where we have an opportunity to retry with a different slurm allocation: "retry the whole job" wastes no more compute time than "retry task 0".

There are two implementation snags to resolve:
  1. Job locking. See note-8 above.
  2. Counting job failures. If we don't record the number of attempts, a job that kills off its node (by exhausting RAM or disk space, for example) will retry indefinitely. If we go the rest of the way and handle all task 0 tempfails by going back to crunch-dispatch and trying from there (not just slurm node failures), any temp-fail could end up retrying forever.

#14 Updated by Tom Clegg over 6 years ago

(Draft) approach to the "retry" vs. "state machine" problem:
  • crunch-job exits EX_RETRY_UNLOCKED if it gets a slurm-related TEMPFAIL from task 0. This must be different than EX_TEMPFAIL so crunch-dispatch can distinguish "I got the lock, retry at will" from "job already locked by another dispatch/job process".
  • crunch-dispatch retries the job (at most N times) if crunch-job exits EX_RETRY_UNLOCKED.
Caveat:
  • This won't be able to fail over to a different crunch-dispatch process. That would require an "unlock and put back in queue" API plus a new job attribute to track the number of attempts.
Logs:
  • The easy answer is to overwrite attempt 1's log manifest with attempt 2's log manifest, but this would be confusing: the event bus would disagree with the "log" attribute, and in the long term the earlier attempts' logs would be lost.
  • Possible solution: in crunch-job, instead of blindly overwriting the log attribute, read the existing log manifest from the API server, concatenate the new log manifest to it, and save the resulting combined manifest. This will involve updating log_writer_* to use arv-put --manifest.

#15 Updated by Brett Smith over 6 years ago

  • Description updated (diff)
  • Assigned To deleted (Brett Smith)
  • Target version changed from Deferred to 2015-07-08 sprint
  • Story points set to 2.0

#16 Updated by Brett Smith over 6 years ago

  • Status changed from In Progress to New

#17 Updated by Brett Smith over 6 years ago

  • Assigned To set to Brett Smith

#18 Updated by Brett Smith over 6 years ago

  • Status changed from New to In Progress

#19 Updated by Brett Smith over 6 years ago

4410-slurm-fails-are-tempfails-wip is up for review. One way to test it is to hack crunch-job to have this block at the top of the THISROUND for loop:

  if (!defined($Job->{log})) {
    Log(undef, "simulating total node failure");
    $working_slot_count = 0;
    last THISROUND;
  }

crunch-dispatch will retry whatever job you submit, and you'll see that message in the logs. You can modify this code to taste to check crunch-dispatch's retry counting, etc.

#20 Updated by Tom Clegg over 6 years ago

At 4db7620, just covering the crunch-job part:

bf207e24d447248b90c25cfdb77a82e85a1fb02c added "arvados.errors.Keep" to the "signs of node failure" regexp. Would this be a good time to revisit that, now that the distinction is even more significant?

Inherited bug, I imagine you were sticking with "preserve existing behavior", but surely this should say "log collection is {pdh}" or "log collection portable_data_hash is {pdh}", not "log manifest is ...":
  • +  Log(undef, "log manifest is " . $log_coll->{portable_data_hash});
    
I think the second line here makes the first line superfluous. Reading the SDK, it seems update_attributes() doesn't update anything locally if the response to the update API call doesn't include the updated attribute, but if we're worried about that IMO we should fix the SDK instead of starting down this road:
  • +  $Job->{'log'} = $log_coll->{portable_data_hash};
    +  $Job->update_attributes('log', $log_coll->{portable_data_hash});
    

Is there a reason we shouldn't call create_output_collection() to rescue partial outputs before exiting 93, in case this is our last/only attempt?

#21 Updated by Brett Smith over 6 years ago

Tom Clegg wrote:

bf207e24d447248b90c25cfdb77a82e85a1fb02c added "arvados.errors.Keep" to the "signs of node failure" regexp. Would this be a good time to revisit that, now that the distinction is even more significant?

I guess my thoughts on that are: The new retry strategy implemented in this branch is "more significant" mainly in that it's higher overhead. On the other hand, in some common cases, this seems more likely achieve the desired result. If the job is raising Keep errors because too many Keep servers are cranky, kicking the job up back to the crunch-dispatch queue to be redeployed will give them more time to settle down and recover than just immediately retrying the same task level loop.

If it's too high overhead for you, that's fair, but then what do you want the new strategy to be? Does the branch need to count Keep failures separately, and retain the "retry the level" behavior for this case? Or do you want to just revert bf207e24?

I implemented the rest of your suggested changes in e359c70 and successfully tested as before. Thanks.

#22 Updated by Tom Clegg over 6 years ago

crunch-dispatch side:

If crunch-dispatch gets SIGTERM while @todo_job_retries is non-empty, it should fail them -- otherwise they'll stay locked but marked "Running". Unless this is already handled somewhere I don't see it, I think this can be done safely as a cleanup step at the end of run() (whereas doing it as soon as SIGTERM is received would allow "exit 93" to win a race with "send SIGINT" and leave the job Running).

Aside: it looks like the !j[:started] condition in the "kill" block was meant to avoid killing jobs once they've started, but (judging by git history) this has never worked because nothing ever sets j[:started]...?

(tbc...)

#23 Updated by Tom Clegg over 6 years ago

Brett Smith wrote:

I guess my thoughts on that are: The new retry strategy implemented in this branch is "more significant" mainly in that it's higher overhead. On the other hand, in some common cases, this seems more likely achieve the desired result. If the job is raising Keep errors because too many Keep servers are cranky, kicking the job up back to the crunch-dispatch queue to be redeployed will give them more time to settle down and recover than just immediately retrying the same task level loop.

Sure, the main distinction is that now we'll retry everything, instead of just needlessly banning nodes and retrying only the failed task(s). AFAICT treating "keep error" as a node failure was a convenient way to make $temporary_fail be true elsewhere in the code, and the side effects of blacklisting the node weren't really considered.

I think the fix is something like this, but in any case if your answer is "no, fixing the existing bug shouldn't hold up this branch" then I'm fine with that.

@@ -952,6 +952,7 @@ for (my $todo_ptr = 0; $todo_ptr <= $#jobstep_todo; $todo_ptr ++)
   $Jobstep->{slotindex} = $childslot;
   delete $Jobstep->{stderr};
   delete $Jobstep->{finishtime};
+  delete $Jobstep->{tempfail};

   $Jobstep->{'arvados_task'}->{started_at} = strftime "%Y-%m-%dT%H:%M:%SZ", gmtime($Jobstep->{starttime});
   $Jobstep->{'arvados_task'}->save;
@@ -1134,7 +1135,7 @@ sub reapchildren
   if (!$task_success)
   {
     my $temporary_fail;
-    $temporary_fail ||= $Jobstep->{node_fail};
+    $temporary_fail ||= $Jobstep->{tempfail};
     $temporary_fail ||= ($exitvalue == TASK_TEMPFAIL);

     ++$thisround_failed;
@@ -1382,8 +1383,11 @@ sub preprocess_stderr
       # whoa.
       $main::please_freeze = 1;
     }
-    elsif ($line =~ /(srun: error: (Node failure on|Unable to create job step|.*: Communication connection failure))|arvados.errors.Keep/) {
-      $jobstep[$job]->{node_fail} = 1;
+    elsif ($line =~ /arvados\.errors\.Keep/) {
+      $jobstep[$job]->{tempfail} = 1;
+    }
+    elsif ($line =~ /srun: error: (Node failure on|Unable to create job step|.*: Communication connection failure)/) {
+      $jobstep[$job]->{tempfail} = 1;
       ban_node_by_slot($jobstep[$job]->{slotindex});
     }
   }

#24 Updated by Tom Clegg over 6 years ago

No difference to behavior, but as an explanation, this comment should be the "else" part of if jobrecord.state == "Running", rather than if !exit_tempfail.
  •     else
          # Don't fail the job if crunch-job didn't even get as far as
          # starting it. [...]
    

The rest @ e359c70 LGTM, thanks.

#25 Updated by Brett Smith over 6 years ago

Tom Clegg wrote:

crunch-dispatch side:

If crunch-dispatch gets SIGTERM while @todo_job_retries is non-empty, it should fail them -- otherwise they'll stay locked but marked "Running". Unless this is already handled somewhere I don't see it, I think this can be done safely as a cleanup step at the end of run() (whereas doing it as soon as SIGTERM is received would allow "exit 93" to win a race with "send SIGINT" and leave the job Running).

Done.

Aside: it looks like the !j[:started] condition in the "kill" block was meant to avoid killing jobs once they've started, but (judging by git history) this has never worked because nothing ever sets j[:started]...?

Agreed. I propose to file and handle this as a separate bug, because I have no idea what behavior we want. (I thought that on the first signal, crunch-dispatch was supposed to wait politely for everything to finish; and start killing its children on the second? But there's no evidence of that in the code.)

No difference to behavior, but as an explanation, this comment should be the "else" part of if jobrecord.state "Running", rather than if !exit_tempfail.

?? There is no "else" branch of if jobrecord.state "Running". I also previously extended the comment to talk about both the 75 and 93 exit codes. I've cleaned up the wording a little bit more to try to make clearer that it applies in multiple situations.

AFAICT treating "keep error" as a node failure was a convenient way to make $temporary_fail be true elsewhere in the code, and the side effects of blacklisting the node weren't really considered.

More generally, I feel like the code was conflating "errors that mean we want to retry" and "errors that suggest the node failed, so crunch-dispatch should retry." Starting from your diff, I split these out. Now instead of potentially giving up on a node because it has started a losing streak (which might be because of, e.g., Keep failures), we only give up if SLURM told us it failed, or if it's had too many temporary failures (as before). I'm not 100% sure I have exactly the right policy here, but I'm sure this is at least an improvement.

Now at faff88f.

#26 Updated by Tom Clegg over 6 years ago

No difference to behavior, but as an explanation, this comment should be the "else" part of if jobrecord.state "Running", rather than if !exit_tempfail.

?? There is no "else" branch of if jobrecord.state "Running". I also previously extended the comment to talk about both the 75 and 93 exit codes. I've cleaned up the wording a little bit more to try to make clearer that it applies in multiple situations.

OK, re-reading I think the new position for the updated comment is correct. I suppose the remaining comment would just be an "else" comment for jobrecord.state == "Running" mentioning that other possibilities include "Queued" ("If crunch-job didn't even get as far as starting the job") in addition to the more obvious "Failed" and "Complete" and "Cancelled". But I think this is sufficiently obvious/uninteresting.

I think this comment could also say "...or we got RETRY_UNLOCKED after reaching RETRY_UNLOCKED_LIMIT" (right?)
  • # Apparently there was an unhandled error.  Fail the job.
    

More generally, I feel like the code was conflating "errors that mean we want to retry" and "errors that suggest the node failed, so crunch-dispatch should retry." Starting from your diff, I split these out. Now instead of potentially giving up on a node because it has started a losing streak (which might be because of, e.g., Keep failures), we only give up if SLURM told us it failed, or if it's had too many temporary failures (as before). I'm not 100% sure I have exactly the right policy here, but I'm sure this is at least an improvement.

Yes this (and the other changes) LGTM too, thanks.

#27 Updated by Brett Smith over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:ad7679cfe57733940f8461097ee01bfd97997ce6.

Also available in: Atom PDF