Idea #3795
closed[Crunch/SDKs] Tasks need more retry support
Description
crunch-job currently retries tasks under a couple of conditions:
- The task exits with the specific "temporary failure" exit code 111.
- crunch-job sees errors from srun that suggest that the problem lies with the node rather than the script.
Recently we've seen a few temporary failure conditions that are not covered by these cases:
- arv-mount fails to contact Keep, causing the Task to fail to read data early on.
- Docker fails to start the container because of an intermittent bug (#3433).
We need to do more to ensure that crunch-job retries Tasks as often as it's appropriate.
Different parts of this problem could be addressed in different ways. For example, we can't modify Docker, so maybe crunch-job should introspect Docker's stderr for messages that indicate intermittent failure.
Code that's under our control like arv-mount could be made to conform to crunch-job's expectations. For example, maybe arv-mount should exit 111 if it encounters trouble talking to Keep.
Putting more retry support in the underlying tools, like the Python SDK (#3147), can mitigate some of the need for this too.