Project

General

Profile

Actions

Idea #3795

closed

[Crunch/SDKs] Tasks need more retry support

Added by Brett Smith over 10 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Start date:
09/03/2014
Due date:
Story points:
-

Description

crunch-job currently retries tasks under a couple of conditions:

  • The task exits with the specific "temporary failure" exit code 111.
  • crunch-job sees errors from srun that suggest that the problem lies with the node rather than the script.

Recently we've seen a few temporary failure conditions that are not covered by these cases:

  • arv-mount fails to contact Keep, causing the Task to fail to read data early on.
  • Docker fails to start the container because of an intermittent bug (#3433).

We need to do more to ensure that crunch-job retries Tasks as often as it's appropriate.

Different parts of this problem could be addressed in different ways. For example, we can't modify Docker, so maybe crunch-job should introspect Docker's stderr for messages that indicate intermittent failure.

Code that's under our control like arv-mount could be made to conform to crunch-job's expectations. For example, maybe arv-mount should exit 111 if it encounters trouble talking to Keep.

Putting more retry support in the underlying tools, like the Python SDK (#3147), can mitigate some of the need for this too.


Subtasks 2 (2 open0 closed)

Task #3801: crunch-job should treat intermittent Docker failures as temporary failuresNew09/03/2014Actions
Task #4321: [SDK] PySDK should arrange exit(111) when a temporary failure exception is uncaughtNew10/27/2014Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Bug #3147: [SDKs] Python clients should automatically retry failed API and Keep requests (including timeouts), in order to survive temporary outages like server restarts and network blips.ResolvedBrett Smith08/22/2014Actions
Related to Arvados - Bug #4410: [Crunch] crunch-job should exit tempfail when a SLURM node failsResolvedBrett Smith11/04/2014Actions
Actions #1

Updated by Ward Vandewege over 10 years ago

  • Target version set to Arvados Future Sprints
Actions #2

Updated by Joshua Randall about 9 years ago

I had 14/400 tasks fail today because of a problem with keep being overloaded ("Connection time-out" / "Operation too slow") in the middle of a run. The keepstore log was printing a lot of messages along the lines of "too many open files; retrying in…". I have now restarted that keepstore and everything seems ok now except that I have 14 failed tasks that I'd like to retry.

This seems to fall into a general category of unhandled system problems that could cause a temporary job failure, so it seems like this story might be able to address it in the long run (although I'm not sure by what mechanism the problem would actually be fixed, as it required restarting a backend keepstore).

I guess there should also be a manual way I can tell crunch that some tasks should be retried because as an admin I have corrected the (system) problem that caused them to fail?

Actions #3

Updated by Tom Clegg almost 8 years ago

  • Status changed from New to Closed
Actions #4

Updated by Tom Clegg almost 8 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF