Project

General

Profile

Bug #20606

Updated by Tom Clegg over 1 year ago

To reproduce: 
 * Wait for a time when there are no preemptible instances available in the size needed by your workflow 
 * Start workflow with preemptible instances enabled 
 * Wait for the workflow to create a lot of child container requests 
 * Cancel the workflow (this might be optional: the workflow might cancel by itself when a child container reaches maximum lock/unlock cycles) 
 * Wait for all child containers to get cancelled 
 * Restart the workflow with preemptible instances disabled 
 * Wait for the workflow to create child container requests 
 * The child container requests reuse the existing (queued, priority 0) containers with preemptible:true, and the preemptible instances still aren't available, so they continue to fail after a few lock/unlock cycles 

 In #19917 we ensured a new container would be scheduled with preemptible:false in this case. However, that doesn't help at all if the container requests aren't going to auto-retry because container_count:1. 

 Ideally, if we have a pair of requests (preemptible:true and preemptible:false) and a queued container with preemptible:true, it is probably better to start a container with preemptible:false, and use it for both reqs, although I think this will be inconvenient to implement. 

 The easier situation, which we encountered today, has a container that that's matches reuse criteria but has preemptible:true _and_ and isn't about to run because it has priority 0 (i.e., whatever CR requested it has since been cancelled/failed). In this situation we should create have created a new container with preemptible:false instead of using the existing one. 

 Another easy change: when a request has preemptible:false, don't reuse a container with preemptible:true that is in Queued or Locked state, even if it has priority>0, because there's a relatively high likelihood it will fail (especially considering the common pattern of "start non-preemptible workflow because preemptible workflow is not getting anywhere"). I think it's OK if this is wasteful in the (less common?) case of a race while cases where the preemptible containers are running well. 

Back