Actions
Feature #19974
openOption to re-submit preemptible jobs to reserved nodes when previous attempt was interrupted
Start date:
Due date:
% Done:
0%
Estimated time:
Story points:
-
Updated by Brett Smith 4 days ago
For both this and #19975, I feel like we need to grapple with and make the big decision about whether we want retry logic to be implemented server-side or client-side. Because I really would rather pick one, and implement one, then have a weird mix where some kinds of retries happen one place and others happen another. Personally I'm leaning towards client-side just because it's easier to implement a wider variety of retry strategies there, but I don't feel too strongly about it.
If we decide to go client-side, then the stories look more like:
- API container records have more information about a container's end state
- Crunch records that information in the API server
- arvados-cwl-runner recognizes various CWL extensions to provide different retry strategies. (This is multiple stories, one per strategy, and they're probably not all equally urgent.)
Updated by Peter Amstutz 4 days ago
Putting this kind of retry logic in the client is my preference as well, among other things because deploying a new arvados-cwl-runner is much lighter weight than deploying a new API server.
Actions