Feature #21175
openDo not retry after "unsupported instance type" EC2 errors
Description
Currently arvados-dispatch-cloud treats "unsupported instance type" as a transient capacity error, and recovers by trying other subnets and other instance types. This works, but generates unnecessary logging noise and API calls by retrying the same instance type (if still needed) after a hold-off period.
This could be improved by having the ec2 driver set an "instance type T unavailable in subnet S" flag for the life of the arvados-dispatch-cloud process and, when that flag is set, skip the EC2 API call and just try the next subnet or return a capacity error.
In the event all configured instance types suitable for a given container are unsupported in all subnets, the current version of a-d-c will wait futilely for them to appear.
This could be improved by having the ec2 driver and worker pool propagate the "permanently unavailable" state back to the scheduler so it can cancel the container.
Related issues
Updated by Tom Clegg 11 months ago
- Related to Feature #20978: Support multiple candidate instance types to assign containers added