Story #3795

[Crunch/SDKs] Tasks need more retry support

Added by Brett Smith over 5 years ago. Updated almost 3 years ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)
Story points:


crunch-job currently retries tasks under a couple of conditions:

  • The task exits with the specific "temporary failure" exit code 111.
  • crunch-job sees errors from srun that suggest that the problem lies with the node rather than the script.

Recently we've seen a few temporary failure conditions that are not covered by these cases:

  • arv-mount fails to contact Keep, causing the Task to fail to read data early on.
  • Docker fails to start the container because of an intermittent bug (#3433).

We need to do more to ensure that crunch-job retries Tasks as often as it's appropriate.

Different parts of this problem could be addressed in different ways. For example, we can't modify Docker, so maybe crunch-job should introspect Docker's stderr for messages that indicate intermittent failure.

Code that's under our control like arv-mount could be made to conform to crunch-job's expectations. For example, maybe arv-mount should exit 111 if it encounters trouble talking to Keep.

Putting more retry support in the underlying tools, like the Python SDK (#3147), can mitigate some of the need for this too.


Task #3801: crunch-job should treat intermittent Docker failures as temporary failuresNew

Task #4321: [SDK] PySDK should arrange exit(111) when a temporary failure exception is uncaughtNew

Related issues

Related to Arvados - Bug #3147: [SDKs] Python clients should automatically retry failed API and Keep requests (including timeouts), in order to survive temporary outages like server restarts and network blips.Resolved08/22/2014

Related to Arvados - Bug #4410: [Crunch] crunch-job should exit tempfail when a SLURM node failsResolved11/04/2014


#1 Updated by Ward Vandewege over 5 years ago

  • Target version set to Arvados Future Sprints

#2 Updated by Joshua Randall about 4 years ago

I had 14/400 tasks fail today because of a problem with keep being overloaded ("Connection time-out" / "Operation too slow") in the middle of a run. The keepstore log was printing a lot of messages along the lines of "too many open files; retrying in…". I have now restarted that keepstore and everything seems ok now except that I have 14 failed tasks that I'd like to retry.

This seems to fall into a general category of unhandled system problems that could cause a temporary job failure, so it seems like this story might be able to address it in the long run (although I'm not sure by what mechanism the problem would actually be fixed, as it required restarting a backend keepstore).

I guess there should also be a manual way I can tell crunch that some tasks should be retried because as an admin I have corrected the (system) problem that caused them to fail?

#3 Updated by Tom Clegg almost 3 years ago

  • Status changed from New to Closed

#4 Updated by Tom Clegg almost 3 years ago

  • Target version deleted (Arvados Future Sprints)

Also available in: Atom PDF