Story #3795

[Crunch/SDKs] Tasks need more retry support

Added by Brett Smith over 2 years ago. Updated 20 days ago.

Status:ClosedStart date:09/03/2014
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Crunch
Target version:-
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

crunch-job currently retries tasks under a couple of conditions:

  • The task exits with the specific "temporary failure" exit code 111.
  • crunch-job sees errors from srun that suggest that the problem lies with the node rather than the script.

Recently we've seen a few temporary failure conditions that are not covered by these cases:

  • arv-mount fails to contact Keep, causing the Task to fail to read data early on.
  • Docker fails to start the container because of an intermittent bug (#3433).

We need to do more to ensure that crunch-job retries Tasks as often as it's appropriate.

Different parts of this problem could be addressed in different ways. For example, we can't modify Docker, so maybe crunch-job should introspect Docker's stderr for messages that indicate intermittent failure.

Code that's under our control like arv-mount could be made to conform to crunch-job's expectations. For example, maybe arv-mount should exit 111 if it encounters trouble talking to Keep.

Putting more retry support in the underlying tools, like the Python SDK (#3147), can mitigate some of the need for this too.


Subtasks

Task #3801: crunch-job should treat intermittent Docker failures as t...New

Task #4321: [SDK] PySDK should arrange exit(111) when a temporary fai...New


Related issues

Related to Arvados - Bug #3147: [SDKs] Python clients should automatically retry failed A... Resolved 08/22/2014
Related to Arvados - Bug #4410: [Crunch] crunch-job should exit tempfail when a SLURM nod... Resolved 11/04/2014

History

#1 Updated by Ward Vandewege over 2 years ago

  • Target version set to Arvados Future Sprints

#2 Updated by Joshua Randall over 1 year ago

I had 14/400 tasks fail today because of a problem with keep being overloaded ("Connection time-out" / "Operation too slow") in the middle of a run. The keepstore log was printing a lot of messages along the lines of "too many open files; retrying in…". I have now restarted that keepstore and everything seems ok now except that I have 14 failed tasks that I'd like to retry.

This seems to fall into a general category of unhandled system problems that could cause a temporary job failure, so it seems like this story might be able to address it in the long run (although I'm not sure by what mechanism the problem would actually be fixed, as it required restarting a backend keepstore).

I guess there should also be a manual way I can tell crunch that some tasks should be retried because as an admin I have corrected the (system) problem that caused them to fail?

#3 Updated by Tom Clegg 22 days ago

  • Status changed from New to Closed

#4 Updated by Tom Clegg 20 days ago

  • Target version deleted (Arvados Future Sprints)

Also available in: Atom PDF