Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests
- Wait for a time when there are no preemptible instances available in the size needed by your workflow
- Start workflow with preemptible instances enabled
- Wait for the workflow to create a lot of child container requests
- Cancel the workflow (this might be optional: the workflow might cancel by itself when a child container reaches maximum lock/unlock cycles)
- Wait for all child containers to get cancelled
- Restart the workflow with preemptible instances disabled
- Wait for the workflow to create child container requests
- The child container requests reuse the existing (queued, priority 0) containers with preemptible:true, and the preemptible instances still aren't available, so they continue to fail after a few lock/unlock cycles
In #19917 we ensured a new container would be scheduled with preemptible:false in this case. However, that doesn't help at all if the container requests aren't going to auto-retry because container_count:1.
Ideally, if we have a pair of requests (preemptible:true and preemptible:false) and a queued container with preemptible:true, it is probably better to start a container with preemptible:false, and use it for both reqs, although I think this will be inconvenient to implement.
The easier situation, which we encountered today, has a container that matches reuse criteria but has preemptible:true and isn't about to run because it has priority 0 (i.e., whatever CR requested it has since been cancelled/failed). In this situation we should create a new container with preemptible:false instead of using the existing one.
Another easy change: when a request has preemptible:false, don't reuse a container with preemptible:true that is in Queued or Locked state, even if it has priority>0, because there's a relatively high likelihood it will fail (especially considering the common pattern of "start non-preemptible workflow because preemptible workflow is not getting anywhere"). I think it's OK if this is wasteful in the (less common?) case of a race while preemptible containers are running well.
This covers the "easier situation" and "another easy change" (see description).
"Ideally ... although inconvenient to implement" (i.e., in the "another easy change" case, also cancel the preemptible container and reassign the new non-preemptible container to the pending preemptible request) is not covered.