Project

General

Profile

Actions

Feature #21460

open

spot instance reclamation is triggers "at capacity" cooloff

Added by Peter Amstutz 3 months ago. Updated 29 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
-

Description

When a spot instance is reclaimed, we don't want to treat it as a instance failure and retry immediately.

Instead, we want to wait a little bit (should be configurable, maybe just add capacityErrorTTL to the config file?) before trying to acquire that instance type again.

When container has indicated that it cancelled because its instance was reclaimed, the spot instance type it was running on should be marked as "at capacity".

If an attempt to allocate a spot instance fails with a "can't get spot instance" error we should also set "at capacity" state.

notes:

This is what it does when a preemption notice happens. The documentation suggests checking for the existence of the preemptionNotice key in the container's runtime_status.

            runner.updateRuntimeStatus(arvadosclient.Dict{
                "warning":          "preemption notice",
                "warningDetail":    text,
                "preemptionNotice": text,
            })

Related issues

Related to Arvados Epics - Idea #18179: Better spot instance supportIn Progress03/01/202206/30/2024Actions
Actions

Also available in: Atom PDF