Bug #19917
closedIssues rerunning workflows with UsePreemptible changes from true to false
Description
If I run a workflow and it gets stuck on a step after a few steps that run (all using preemptible instances), then I kill the job, mark preemptible to false and rerun, it doesn't rerun the sets that run but still tries to use preemptible nodes.
Examples was in 2xpu4 so not sure I can share wf...but this is what Tom wrote:
"I think I see the bug... the retry-after-cancel logic uses the same scheduling_parameters as the cancelled container, even if the still-active requests that are motivating the retry all say preemptible:false."
Tom: this is not a "container reuse" problem it is a "container retry" problem, it should take the scheduling parameters from the actual outstanding container requests and not the cancelled one.
Related issues
Updated by Sarah Zaranek almost 2 years ago
If you have access wf here https://workbench2.2xpu4.arvadosapi.com/processes/2xpu4-xvhdp-ilx0iftqfyzsyuu
Updated by Peter Amstutz almost 2 years ago
- Assigned To set to Tom Clegg
- Subject changed from Issues rerunning workflows with UsePreemptible changes from true to false to Issues rerunning workflows with UsePreemptible changes from true to false
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Future to To be scheduled
Updated by Peter Amstutz almost 2 years ago
- Target version changed from To be scheduled to 2023-02-01 sprint
Updated by Peter Amstutz almost 2 years ago
- Assigned To set to Brett Smith
- Description updated (diff)
Updated by Brett Smith almost 2 years ago
Stepping back from the immediate bug and thinking about this more generally:
When we enter the if retryable_requests.any?
branch of Container#handle_completed
, there could potentially be multiple container requests to retry, with potentially different scheduling parameters. Which should we use for the new container?
The easiest thing to do would be to pick an arbitrary one, use that, and document it.
The super-slick thing to do would be to try to synthesize scheduling parameters that meet all the current constraints. I think that would be:
{ partitions: empty if any are empty, else the largest set, preemptible: true if all are true, else false, max_run_time: 0 if any are 0, else the maximum, }
This is a little surprising, because I think most people's mental model of scheduling parameters is it should make the running environment more strict, and here we're taking the least strict of all the options. But using this bug report as guidance, I think the least surprising thing to do in this case of "multiple container requests with different scheduling parameters" is to give the container the best resources anyone was willing to specify, and this is that.
… I was gonna propose a middle ground compromise solution but I don't see much need, the best solution isn't so complicated that seems worthwhile.
Updated by Tom Clegg almost 2 years ago
I think in order to address the "run on non-preemptible because preemptible instances keep failing" situation reliably, we need the "preemptible: true if all are true, else false" option. And it only makes sense to do the analogous things with the other constraints too, if it's as easy as it sounds.
(Even though we will still have a risk of getting stuck when two CRs have partitions: [A] and partitions: [B,C] and A is the only partition that's actually running anything, "empty or largest set" seems like the best we could possibly do without a ridiculous amount of scope creep.)
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-10:
(Even though we will still have a risk of getting stuck when two CRs have partitions: [A] and partitions: [B,C] and A is the only partition that's actually running anything, "empty or largest set" seems like the best we could possibly do without a ridiculous amount of scope creep.)
I had the thought that we could take the union of all partition requests. That's probably marginally better at preventing this situation. Again, it feels weird, but on reflection I can't think of a good, concrete reason not to do it.
19917-retry-scheduling-parameters @ b896832bf02ded0c9142d758c2866fa4f1ec09e9 - developer-run-tests: #3452
Updated by Brett Smith almost 2 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|6c4c662cc5b22883ca1d4f9df9866a8b891a8e8a.
Updated by Tom Clegg over 1 year ago
- Related to Bug #20606: Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests added