Project

General

Profile

Actions

Feature #21175

open

Do not retry after "unsupported instance type" EC2 errors

Added by Tom Clegg 6 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-

Description

Currently arvados-dispatch-cloud treats "unsupported instance type" as a transient capacity error, and recovers by trying other subnets and other instance types. This works, but generates unnecessary logging noise and API calls by retrying the same instance type (if still needed) after a hold-off period.

This could be improved by having the ec2 driver set an "instance type T unavailable in subnet S" flag for the life of the arvados-dispatch-cloud process and, when that flag is set, skip the EC2 API call and just try the next subnet or return a capacity error.

In the event all configured instance types suitable for a given container are unsupported in all subnets, the current version of a-d-c will wait futilely for them to appear.

This could be improved by having the ec2 driver and worker pool propagate the "permanently unavailable" state back to the scheduler so it can cancel the container.


Related issues

Related to Arvados - Feature #20978: Support multiple candidate instance types to assign containersResolvedTom Clegg10/31/2023Actions
Actions #1

Updated by Tom Clegg 6 months ago

  • Related to Feature #20978: Support multiple candidate instance types to assign containers added
Actions

Also available in: Atom PDF