Project

General

Profile

Actions

Bug #19917

closed

Issues rerunning workflows with UsePreemptible changes from true to false

Added by Sarah Zaranek almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release relationship:
Auto

Description

If I run a workflow and it gets stuck on a step after a few steps that run (all using preemptible instances), then I kill the job, mark preemptible to false and rerun, it doesn't rerun the sets that run but still tries to use preemptible nodes.

Examples was in 2xpu4 so not sure I can share wf...but this is what Tom wrote:
"I think I see the bug... the retry-after-cancel logic uses the same scheduling_parameters as the cancelled container, even if the still-active requests that are motivating the retry all say preemptible:false."

Tom: this is not a "container reuse" problem it is a "container retry" problem, it should take the scheduling parameters from the actual outstanding container requests and not the cancelled one.


Subtasks 1 (0 open1 closed)

Task #19945: Review 19917-retry-scheduling-parametersResolvedBrett Smith01/19/2023Actions

Related issues

Related to Arvados - Bug #20606: Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requestsResolvedTom Clegg06/27/2023Actions
Actions #2

Updated by Peter Amstutz almost 2 years ago

  • Target version set to Future
Actions #3

Updated by Peter Amstutz almost 2 years ago

  • Assigned To set to Tom Clegg
  • Subject changed from Issues rerunning workflows with UsePreemptible changes from true to false to Issues rerunning workflows with UsePreemptible changes from true to false
Actions #4

Updated by Peter Amstutz almost 2 years ago

  • Assigned To deleted (Tom Clegg)
Actions #5

Updated by Tom Clegg almost 2 years ago

  • Story points set to 1.0
Actions #6

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from Future to To be scheduled
Actions #7

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from To be scheduled to 2023-02-01 sprint
Actions #8

Updated by Peter Amstutz almost 2 years ago

  • Assigned To set to Brett Smith
  • Description updated (diff)
Actions #9

Updated by Brett Smith almost 2 years ago

Stepping back from the immediate bug and thinking about this more generally:

When we enter the if retryable_requests.any? branch of Container#handle_completed, there could potentially be multiple container requests to retry, with potentially different scheduling parameters. Which should we use for the new container?

The easiest thing to do would be to pick an arbitrary one, use that, and document it.

The super-slick thing to do would be to try to synthesize scheduling parameters that meet all the current constraints. I think that would be:

{
  partitions: empty if any are empty, else the largest set,
  preemptible: true if all are true, else false,
  max_run_time: 0 if any are 0, else the maximum,
}

This is a little surprising, because I think most people's mental model of scheduling parameters is it should make the running environment more strict, and here we're taking the least strict of all the options. But using this bug report as guidance, I think the least surprising thing to do in this case of "multiple container requests with different scheduling parameters" is to give the container the best resources anyone was willing to specify, and this is that.

… I was gonna propose a middle ground compromise solution but I don't see much need, the best solution isn't so complicated that seems worthwhile.

Actions #10

Updated by Tom Clegg almost 2 years ago

I think in order to address the "run on non-preemptible because preemptible instances keep failing" situation reliably, we need the "preemptible: true if all are true, else false" option. And it only makes sense to do the analogous things with the other constraints too, if it's as easy as it sounds.

(Even though we will still have a risk of getting stuck when two CRs have partitions: [A] and partitions: [B,C] and A is the only partition that's actually running anything, "empty or largest set" seems like the best we could possibly do without a ridiculous amount of scope creep.)

Actions #11

Updated by Brett Smith almost 2 years ago

Tom Clegg wrote in #note-10:

(Even though we will still have a risk of getting stuck when two CRs have partitions: [A] and partitions: [B,C] and A is the only partition that's actually running anything, "empty or largest set" seems like the best we could possibly do without a ridiculous amount of scope creep.)

I had the thought that we could take the union of all partition requests. That's probably marginally better at preventing this situation. Again, it feels weird, but on reflection I can't think of a good, concrete reason not to do it.

19917-retry-scheduling-parameters @ b896832bf02ded0c9142d758c2866fa4f1ec09e9 - developer-run-tests: #3452

Actions #12

Updated by Tom Clegg almost 2 years ago

  • Status changed from New to In Progress

This LGTM, thanks!

Actions #13

Updated by Brett Smith almost 2 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #14

Updated by Peter Amstutz almost 2 years ago

  • Release set to 57
Actions #15

Updated by Tom Clegg over 1 year ago

  • Related to Bug #20606: Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests added
Actions

Also available in: Atom PDF