Project

General

Profile

Actions

Feature #19974

open

Option to re-submit preemptible jobs to reserved nodes when previous attempt was interrupted

Added by Peter Amstutz 4 days ago. Updated 4 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Actions #1

Updated by Brett Smith 4 days ago

For both this and #19975, I feel like we need to grapple with and make the big decision about whether we want retry logic to be implemented server-side or client-side. Because I really would rather pick one, and implement one, then have a weird mix where some kinds of retries happen one place and others happen another. Personally I'm leaning towards client-side just because it's easier to implement a wider variety of retry strategies there, but I don't feel too strongly about it.

If we decide to go client-side, then the stories look more like:

  • API container records have more information about a container's end state
  • Crunch records that information in the API server
  • arvados-cwl-runner recognizes various CWL extensions to provide different retry strategies. (This is multiple stories, one per strategy, and they're probably not all equally urgent.)
Actions #2

Updated by Peter Amstutz 4 days ago

Putting this kind of retry logic in the client is my preference as well, among other things because deploying a new arvados-cwl-runner is much lighter weight than deploying a new API server.

Actions

Also available in: Atom PDF