Project

General

Profile

Actions

Bug #20606

closed

Unstartable preemptible:true containers should not be reused by non-retryable preemptible:false requests

Added by Tom Clegg over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
1.0
Release relationship:
Auto

Description

To reproduce:
  • Wait for a time when there are no preemptible instances available in the size needed by your workflow
  • Start workflow with preemptible instances enabled
  • Wait for the workflow to create a lot of child container requests
  • Cancel the workflow (this might be optional: the workflow might cancel by itself when a child container reaches maximum lock/unlock cycles)
  • Wait for all child containers to get cancelled
  • Restart the workflow with preemptible instances disabled
  • Wait for the workflow to create child container requests
  • The child container requests reuse the existing (queued, priority 0) containers with preemptible:true, and the preemptible instances still aren't available, so they continue to fail after a few lock/unlock cycles

In #19917 we ensured a new container would be scheduled with preemptible:false in this case. However, that doesn't help at all if the container requests aren't going to auto-retry because container_count:1.

Ideally, if we have a pair of requests (preemptible:true and preemptible:false) and a queued container with preemptible:true, it is probably better to start a container with preemptible:false, and use it for both reqs, although I think this will be inconvenient to implement.

The easier situation, which we encountered today, has a container that matches reuse criteria but has preemptible:true and isn't about to run because it has priority 0 (i.e., whatever CR requested it has since been cancelled/failed). In this situation we should create a new container with preemptible:false instead of using the existing one.

Another easy change: when a request has preemptible:false, don't reuse a container with preemptible:true that is in Queued or Locked state, even if it has priority>0, because there's a relatively high likelihood it will fail (especially considering the common pattern of "start non-preemptible workflow because preemptible workflow is not getting anywhere"). I think it's OK if this is wasteful in the (less common?) case of a race while preemptible containers are running well.


Subtasks 1 (0 open1 closed)

Task #20682: Review 20606-reuse-preemptibleResolvedTom Clegg06/27/2023Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Bug #19917: Issues rerunning workflows with UsePreemptible changes from true to falseResolvedBrett Smith01/19/2023Actions
Related to Arvados - Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmitResolvedPeter AmstutzActions
Actions #1

Updated by Tom Clegg over 1 year ago

  • Story points set to 1.0
  • Target version set to To be scheduled
  • Category set to Crunch
Actions #2

Updated by Tom Clegg over 1 year ago

  • Related to Bug #19917: Issues rerunning workflows with UsePreemptible changes from true to false added
Actions #3

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz over 1 year ago

  • Target version changed from To be scheduled to Development 2023-07-05 sprint
Actions #5

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Tom Clegg
Actions #6

Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress
Actions #7

Updated by Tom Clegg over 1 year ago

20606-reuse-preemptible @ c8dc89d59755226183a50de2e5d679548bcb984b -- developer-run-tests: #3724

This covers the "easier situation" and "another easy change" (see description).

"Ideally ... although inconvenient to implement" (i.e., in the "another easy change" case, also cancel the preemptible container and reassign the new non-preemptible container to the pending preemptible request) is not covered.

Actions #8

Updated by Lucas Di Pentima over 1 year ago

This LGTM, thanks!

Actions #9

Updated by Tom Clegg over 1 year ago

  • Status changed from In Progress to Resolved
Actions #10

Updated by Peter Amstutz over 1 year ago

  • Release set to 66
Actions #11

Updated by Brett Smith over 1 year ago

  • Related to Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmit added
Actions

Also available in: Atom PDF