Project

General

Profile

Actions

Bug #20606

closed

Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests

Added by Tom Clegg 11 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
1.0
Release relationship:
Auto

Description

To reproduce:
  • Wait for a time when there are no preemptible instances available in the size needed by your workflow
  • Start workflow with preemptible instances enabled
  • Wait for the workflow to create a lot of child container requests
  • Cancel the workflow (this might be optional: the workflow might cancel by itself when a child container reaches maximum lock/unlock cycles)
  • Wait for all child containers to get cancelled
  • Restart the workflow with preemptible instances disabled
  • Wait for the workflow to create child container requests
  • The child container requests reuse the existing (queued, priority 0) containers with preemptible:true, and the preemptible instances still aren't available, so they continue to fail after a few lock/unlock cycles

In #19917 we ensured a new container would be scheduled with preemptible:false in this case. However, that doesn't help at all if the container requests aren't going to auto-retry because container_count:1.

Ideally, if we have a pair of requests (preemptible:true and preemptible:false) and a queued container with preemptible:true, it is probably better to start a container with preemptible:false, and use it for both reqs, although I think this will be inconvenient to implement.

The easier situation, which we encountered today, has a container that matches reuse criteria but has preemptible:true and isn't about to run because it has priority 0 (i.e., whatever CR requested it has since been cancelled/failed). In this situation we should create a new container with preemptible:false instead of using the existing one.

Another easy change: when a request has preemptible:false, don't reuse a container with preemptible:true that is in Queued or Locked state, even if it has priority>0, because there's a relatively high likelihood it will fail (especially considering the common pattern of "start non-preemptible workflow because preemptible workflow is not getting anywhere"). I think it's OK if this is wasteful in the (less common?) case of a race while preemptible containers are running well.


Subtasks 1 (0 open1 closed)

Task #20682: Review 20606-reuse-preemptibleResolvedTom Clegg06/27/2023Actions

Related issues

Related to Arvados - Bug #19917: Issues rerunning workflows with UsePreemptible changes from true to falseResolvedBrett Smith01/19/2023Actions
Related to Arvados - Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmitIn ProgressAlex ColemanActions
Actions

Also available in: Atom PDF