Project

General

Profile

Actions

Feature #21460

open

spot instance reclamation is triggers "at capacity" cooloff

Added by Peter Amstutz 3 months ago. Updated 17 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
-

Description

When a spot instance is reclaimed, we don't want to treat it as a instance failure and retry immediately.

Instead, we want to wait a little bit (should be configurable, maybe just add capacityErrorTTL to the config file?) before trying to acquire that instance type again.

When container has indicated that it cancelled because its instance was reclaimed, the spot instance type it was running on should be marked as "at capacity".

If an attempt to allocate a spot instance fails with a "can't get spot instance" error we should also set "at capacity" state.

notes:

This is what it does when a preemption notice happens. The documentation suggests checking for the existence of the preemptionNotice key in the container's runtime_status.

            runner.updateRuntimeStatus(arvadosclient.Dict{
                "warning":          "preemption notice",
                "warningDetail":    text,
                "preemptionNotice": text,
            })

Related issues

Related to Arvados Epics - Idea #18179: Better spot instance supportIn Progress03/01/202206/30/2024Actions
Actions #1

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 3 months ago

  • Subject changed from Put a temporary hold on an instance type when spot instance reclamation is detected to spot instance reclamation is triggers "at capacity" cooloff
Actions #4

Updated by Peter Amstutz 3 months ago

  • Related to Idea #18179: Better spot instance support added
Actions #5

Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz 3 months ago

  • Target version changed from Future to Development 2024-03-27 sprint
Actions #7

Updated by Brett Smith 3 months ago

  • Description updated (diff)
Actions #9

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-24 sprint
Actions #10

Updated by Peter Amstutz 18 days ago

  • Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Actions #11

Updated by Peter Amstutz 18 days ago

  • Target version changed from Development 2024-05-08 sprint to Development 2024-06-05 sprint
Actions #12

Updated by Peter Amstutz 17 days ago

  • Target version changed from Development 2024-06-05 sprint to Future
Actions

Also available in: Atom PDF