Option to re-submit preemptible jobs to reserved nodes when previous attempt was interrupted
Updated by Brett Smith 11 months ago
For both this and #19975, I feel like we need to grapple with and make the big decision about whether we want retry logic to be implemented server-side or client-side. Because I really would rather pick one, and implement one, then have a weird mix where some kinds of retries happen one place and others happen another. Personally I'm leaning towards client-side just because it's easier to implement a wider variety of retry strategies there, but I don't feel too strongly about it.
If we decide to go client-side, then the stories look more like:
- API container records have more information about a container's end state
- Crunch records that information in the API server
- arvados-cwl-runner recognizes various CWL extensions to provide different retry strategies. (This is multiple stories, one per strategy, and they're probably not all equally urgent.)