Bug #20606
closedUnstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests
Description
- Wait for a time when there are no preemptible instances available in the size needed by your workflow
- Start workflow with preemptible instances enabled
- Wait for the workflow to create a lot of child container requests
- Cancel the workflow (this might be optional: the workflow might cancel by itself when a child container reaches maximum lock/unlock cycles)
- Wait for all child containers to get cancelled
- Restart the workflow with preemptible instances disabled
- Wait for the workflow to create child container requests
- The child container requests reuse the existing (queued, priority 0) containers with preemptible:true, and the preemptible instances still aren't available, so they continue to fail after a few lock/unlock cycles
In #19917 we ensured a new container would be scheduled with preemptible:false in this case. However, that doesn't help at all if the container requests aren't going to auto-retry because container_count:1.
Ideally, if we have a pair of requests (preemptible:true and preemptible:false) and a queued container with preemptible:true, it is probably better to start a container with preemptible:false, and use it for both reqs, although I think this will be inconvenient to implement.
The easier situation, which we encountered today, has a container that matches reuse criteria but has preemptible:true and isn't about to run because it has priority 0 (i.e., whatever CR requested it has since been cancelled/failed). In this situation we should create a new container with preemptible:false instead of using the existing one.
Another easy change: when a request has preemptible:false, don't reuse a container with preemptible:true that is in Queued or Locked state, even if it has priority>0, because there's a relatively high likelihood it will fail (especially considering the common pattern of "start non-preemptible workflow because preemptible workflow is not getting anywhere"). I think it's OK if this is wasteful in the (less common?) case of a race while preemptible containers are running well.
Related issues