Project

General

Profile

Actions

Bug #19917

closed

Issues rerunning workflows with UsePreemptible changes from true to false

Added by Sarah Zaranek 29 days ago. Updated 14 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/19/2023
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release relationship:
Auto

Description

If I run a workflow and it gets stuck on a step after a few steps that run (all using preemptible instances), then I kill the job, mark preemptible to false and rerun, it doesn't rerun the sets that run but still tries to use preemptible nodes.

Examples was in 2xpu4 so not sure I can share wf...but this is what Tom wrote:
"I think I see the bug... the retry-after-cancel logic uses the same scheduling_parameters as the cancelled container, even if the still-active requests that are motivating the retry all say preemptible:false."

Tom: this is not a "container reuse" problem it is a "container retry" problem, it should take the scheduling parameters from the actual outstanding container requests and not the cancelled one.


Subtasks 1 (0 open1 closed)

Task #19945: Review 19917-retry-scheduling-parametersResolvedBrett Smith01/19/2023

Actions
Actions #2

Updated by Peter Amstutz 29 days ago

  • Target version set to To be groomed
Actions #3

Updated by Peter Amstutz 28 days ago

  • Assigned To set to Tom Clegg
  • Subject changed from Issues rerunning workflows with UsePreemptible changes from true to false to Issues rerunning workflows with UsePreemptible changes from true to false
Actions #4

Updated by Peter Amstutz 23 days ago

  • Assigned To deleted (Tom Clegg)
Actions #5

Updated by Tom Clegg 22 days ago

  • Story points set to 1.0
Actions #6

Updated by Peter Amstutz 22 days ago

  • Target version changed from To be groomed to To be scheduled
Actions #7

Updated by Peter Amstutz 22 days ago

  • Target version changed from To be scheduled to 2023-02-01 sprint
Actions #8

Updated by Peter Amstutz 22 days ago

  • Assigned To set to Brett Smith
  • Description updated (diff)
Actions #9

Updated by Brett Smith 21 days ago

Stepping back from the immediate bug and thinking about this more generally:

When we enter the if retryable_requests.any? branch of Container#handle_completed, there could potentially be multiple container requests to retry, with potentially different scheduling parameters. Which should we use for the new container?

The easiest thing to do would be to pick an arbitrary one, use that, and document it.

The super-slick thing to do would be to try to synthesize scheduling parameters that meet all the current constraints. I think that would be:

{
  partitions: empty if any are empty, else the largest set,
  preemptible: true if all are true, else false,
  max_run_time: 0 if any are 0, else the maximum,
}

This is a little surprising, because I think most people's mental model of scheduling parameters is it should make the running environment more strict, and here we're taking the least strict of all the options. But using this bug report as guidance, I think the least surprising thing to do in this case of "multiple container requests with different scheduling parameters" is to give the container the best resources anyone was willing to specify, and this is that.

… I was gonna propose a middle ground compromise solution but I don't see much need, the best solution isn't so complicated that seems worthwhile.

Actions #10

Updated by Tom Clegg 21 days ago

I think in order to address the "run on non-preemptible because preemptible instances keep failing" situation reliably, we need the "preemptible: true if all are true, else false" option. And it only makes sense to do the analogous things with the other constraints too, if it's as easy as it sounds.

(Even though we will still have a risk of getting stuck when two CRs have partitions: [A] and partitions: [B,C] and A is the only partition that's actually running anything, "empty or largest set" seems like the best we could possibly do without a ridiculous amount of scope creep.)

Actions #11

Updated by Brett Smith 20 days ago

Tom Clegg wrote in #note-10:

(Even though we will still have a risk of getting stuck when two CRs have partitions: [A] and partitions: [B,C] and A is the only partition that's actually running anything, "empty or largest set" seems like the best we could possibly do without a ridiculous amount of scope creep.)

I had the thought that we could take the union of all partition requests. That's probably marginally better at preventing this situation. Again, it feels weird, but on reflection I can't think of a good, concrete reason not to do it.

19917-retry-scheduling-parameters @ b896832bf02ded0c9142d758c2866fa4f1ec09e9 - developer-run-tests: #3452

Actions #12

Updated by Tom Clegg 20 days ago

  • Status changed from New to In Progress

This LGTM, thanks!

Actions #13

Updated by Brett Smith 20 days ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #14

Updated by Peter Amstutz 14 days ago

  • Release set to 57
Actions

Also available in: Atom PDF