Option to re-submit preemptible jobs to reserved nodes when previous attempt was interrupted
Updated by Brett Smith 2 months ago
For both this and #19975, I feel like we need to grapple with and make the big decision about whether we want retry logic to be implemented server-side or client-side. Because I really would rather pick one, and implement one, then have a weird mix where some kinds of retries happen one place and others happen another. Personally I'm leaning towards client-side just because it's easier to implement a wider variety of retry strategies there, but I don't feel too strongly about it.
If we decide to go client-side, then the stories look more like:
- API container records have more information about a container's end state
- Crunch records that information in the API server
- arvados-cwl-runner recognizes various CWL extensions to provide different retry strategies. (This is multiple stories, one per strategy, and they're probably not all equally urgent.)
Updated by Peter Amstutz 2 months ago
Putting this kind of retry logic in the client is my preference as well, among other things because deploying a new arvados-cwl-runner is much lighter weight than deploying a new API server.
Updated by Peter Amstutz about 1 month ago
- Related to Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmit added