Feature #19974
openOption to re-submit preemptible jobs to reserved nodes when previous attempt was interrupted
0%
Related issues
Updated by Brett Smith 11 months ago
For both this and #19975, I feel like we need to grapple with and make the big decision about whether we want retry logic to be implemented server-side or client-side. Because I really would rather pick one, and implement one, then have a weird mix where some kinds of retries happen one place and others happen another. Personally I'm leaning towards client-side just because it's easier to implement a wider variety of retry strategies there, but I don't feel too strongly about it.
If we decide to go client-side, then the stories look more like:
- API container records have more information about a container's end state
- Crunch records that information in the API server
- arvados-cwl-runner recognizes various CWL extensions to provide different retry strategies. (This is multiple stories, one per strategy, and they're probably not all equally urgent.)
Updated by Peter Amstutz 11 months ago
Putting this kind of retry logic in the client is my preference as well, among other things because deploying a new arvados-cwl-runner is much lighter weight than deploying a new API server.
Updated by Peter Amstutz 10 months ago
- Related to Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmit added