Project

General

Profile

Actions

Bug #19917

closed

Issues rerunning workflows with UsePreemptible changes from true to false

Added by Sarah Zaranek about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release relationship:
Auto

Description

If I run a workflow and it gets stuck on a step after a few steps that run (all using preemptible instances), then I kill the job, mark preemptible to false and rerun, it doesn't rerun the sets that run but still tries to use preemptible nodes.

Examples was in 2xpu4 so not sure I can share wf...but this is what Tom wrote:
"I think I see the bug... the retry-after-cancel logic uses the same scheduling_parameters as the cancelled container, even if the still-active requests that are motivating the retry all say preemptible:false."

Tom: this is not a "container reuse" problem it is a "container retry" problem, it should take the scheduling parameters from the actual outstanding container requests and not the cancelled one.


Subtasks 1 (0 open1 closed)

Task #19945: Review 19917-retry-scheduling-parametersResolvedBrett Smith01/19/2023Actions

Related issues

Related to Arvados - Bug #20606: Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requestsResolvedTom Clegg06/27/2023Actions
Actions

Also available in: Atom PDF