Project

General

Profile

Actions

Idea #3795

closed

[Crunch/SDKs] Tasks need more retry support

Added by Brett Smith over 9 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Start date:
09/03/2014
Due date:
Story points:
-

Description

crunch-job currently retries tasks under a couple of conditions:

  • The task exits with the specific "temporary failure" exit code 111.
  • crunch-job sees errors from srun that suggest that the problem lies with the node rather than the script.

Recently we've seen a few temporary failure conditions that are not covered by these cases:

  • arv-mount fails to contact Keep, causing the Task to fail to read data early on.
  • Docker fails to start the container because of an intermittent bug (#3433).

We need to do more to ensure that crunch-job retries Tasks as often as it's appropriate.

Different parts of this problem could be addressed in different ways. For example, we can't modify Docker, so maybe crunch-job should introspect Docker's stderr for messages that indicate intermittent failure.

Code that's under our control like arv-mount could be made to conform to crunch-job's expectations. For example, maybe arv-mount should exit 111 if it encounters trouble talking to Keep.

Putting more retry support in the underlying tools, like the Python SDK (#3147), can mitigate some of the need for this too.


Subtasks 2 (2 open0 closed)

Task #3801: crunch-job should treat intermittent Docker failures as temporary failuresNew09/03/2014Actions
Task #4321: [SDK] PySDK should arrange exit(111) when a temporary failure exception is uncaughtNew10/27/2014Actions

Related issues

Related to Arvados - Bug #3147: [SDKs] Python clients should automatically retry failed API and Keep requests (including timeouts), in order to survive temporary outages like server restarts and network blips.ResolvedBrett Smith08/22/2014Actions
Related to Arvados - Bug #4410: [Crunch] crunch-job should exit tempfail when a SLURM node failsResolvedBrett Smith11/04/2014Actions
Actions

Also available in: Atom PDF