Project

General

Profile

Actions

Feature #19974

open

Option to re-submit preemptible jobs to reserved nodes when previous attempt was interrupted

Added by Peter Amstutz about 1 year ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
-

Related issues

Related to Arvados - Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmitIn ProgressAlex ColemanActions
Actions #1

Updated by Brett Smith about 1 year ago

For both this and #19975, I feel like we need to grapple with and make the big decision about whether we want retry logic to be implemented server-side or client-side. Because I really would rather pick one, and implement one, then have a weird mix where some kinds of retries happen one place and others happen another. Personally I'm leaning towards client-side just because it's easier to implement a wider variety of retry strategies there, but I don't feel too strongly about it.

If we decide to go client-side, then the stories look more like:

  • API container records have more information about a container's end state
  • Crunch records that information in the API server
  • arvados-cwl-runner recognizes various CWL extensions to provide different retry strategies. (This is multiple stories, one per strategy, and they're probably not all equally urgent.)
Actions #2

Updated by Peter Amstutz about 1 year ago

Putting this kind of retry logic in the client is my preference as well, among other things because deploying a new arvados-cwl-runner is much lighter weight than deploying a new API server.

Actions #3

Updated by Peter Amstutz about 1 year ago

  • Related to Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmit added
Actions

Also available in: Atom PDF