[Crunch/SDKs] Tasks need more retry support
|Story points||-||Remaining (hours)||0.00 hour|
|Velocity based estimate||-|
crunch-job currently retries tasks under a couple of conditions:
- The task exits with the specific "temporary failure" exit code 111.
- crunch-job sees errors from srun that suggest that the problem lies with the node rather than the script.
Recently we've seen a few temporary failure conditions that are not covered by these cases:
- arv-mount fails to contact Keep, causing the Task to fail to read data early on.
- Docker fails to start the container because of an intermittent bug (#3433).
We need to do more to ensure that crunch-job retries Tasks as often as it's appropriate.
Different parts of this problem could be addressed in different ways. For example, we can't modify Docker, so maybe crunch-job should introspect Docker's stderr for messages that indicate intermittent failure.
Code that's under our control like arv-mount could be made to conform to crunch-job's expectations. For example, maybe arv-mount should exit 111 if it encounters trouble talking to Keep.
Putting more retry support in the underlying tools, like the Python SDK (#3147), can mitigate some of the need for this too.
#2 Updated by Joshua Randall almost 2 years ago
I had 14/400 tasks fail today because of a problem with keep being overloaded ("Connection time-out" / "Operation too slow") in the middle of a run. The keepstore log was printing a lot of messages along the lines of "too many open files; retrying in…". I have now restarted that keepstore and everything seems ok now except that I have 14 failed tasks that I'd like to retry.
This seems to fall into a general category of unhandled system problems that could cause a temporary job failure, so it seems like this story might be able to address it in the long run (although I'm not sure by what mechanism the problem would actually be fixed, as it required restarting a backend keepstore).
I guess there should also be a manual way I can tell crunch that some tasks should be retried because as an admin I have corrected the (system) problem that caused them to fail?