Project

General

Profile

Bug #19917

Updated by Peter Amstutz over 1 year ago

If I run a workflow and it gets stuck on a step after a few steps that run (all using preemptible instances), then I kill the job, mark preemptible to false and rerun, it doesn't rerun the sets that run but still tries to use preemptible nodes. 

 Examples was in 2xpu4 so not sure I can share wf...but this is what Tom wrote: 
 "I think I see the bug... the retry-after-cancel logic uses the same scheduling_parameters as the cancelled container, even if the still-active requests that are motivating the retry all say preemptible:false." 

 Tom: this is not a "container reuse" problem it is a "container retry" problem, it should take the scheduling parameters from the actual outstanding container requests and not the cancelled one. 

Back