Bug #20606
closedUnstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests
Description
- Wait for a time when there are no preemptible instances available in the size needed by your workflow
- Start workflow with preemptible instances enabled
- Wait for the workflow to create a lot of child container requests
- Cancel the workflow (this might be optional: the workflow might cancel by itself when a child container reaches maximum lock/unlock cycles)
- Wait for all child containers to get cancelled
- Restart the workflow with preemptible instances disabled
- Wait for the workflow to create child container requests
- The child container requests reuse the existing (queued, priority 0) containers with preemptible:true, and the preemptible instances still aren't available, so they continue to fail after a few lock/unlock cycles
In #19917 we ensured a new container would be scheduled with preemptible:false in this case. However, that doesn't help at all if the container requests aren't going to auto-retry because container_count:1.
Ideally, if we have a pair of requests (preemptible:true and preemptible:false) and a queued container with preemptible:true, it is probably better to start a container with preemptible:false, and use it for both reqs, although I think this will be inconvenient to implement.
The easier situation, which we encountered today, has a container that matches reuse criteria but has preemptible:true and isn't about to run because it has priority 0 (i.e., whatever CR requested it has since been cancelled/failed). In this situation we should create a new container with preemptible:false instead of using the existing one.
Another easy change: when a request has preemptible:false, don't reuse a container with preemptible:true that is in Queued or Locked state, even if it has priority>0, because there's a relatively high likelihood it will fail (especially considering the common pattern of "start non-preemptible workflow because preemptible workflow is not getting anywhere"). I think it's OK if this is wasteful in the (less common?) case of a race while preemptible containers are running well.
Related issues
Updated by Tom Clegg over 1 year ago
- Story points set to 1.0
- Target version set to To be scheduled
- Category set to Crunch
Updated by Tom Clegg over 1 year ago
- Related to Bug #19917: Issues rerunning workflows with UsePreemptible changes from true to false added
Updated by Peter Amstutz over 1 year ago
- Target version changed from To be scheduled to Development 2023-07-05 sprint
Updated by Tom Clegg over 1 year ago
- Status changed from New to In Progress
20606-reuse-preemptible @ 9ef184fa59507f3fe6b19f3b2fe77699f30499ee -- developer-run-tests: #3723
Updated by Tom Clegg over 1 year ago
20606-reuse-preemptible @ c8dc89d59755226183a50de2e5d679548bcb984b -- developer-run-tests: #3724
This covers the "easier situation" and "another easy change" (see description).
"Ideally ... although inconvenient to implement" (i.e., in the "another easy change" case, also cancel the preemptible container and reassign the new non-preemptible container to the pending preemptible request) is not covered.
Updated by Tom Clegg over 1 year ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|0630e05d68440e421d622d1f26e956c65f3d9668.
Updated by Brett Smith about 1 year ago
- Related to Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmit added