Project

General

Profile

Actions

Bug #22017

closed

a-d-c needs to handle different quotas for difference instance types

Added by Peter Amstutz 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-
Release:
Release relationship:
Auto

Description

Diagnosed the following problem on a user cluster:

a) User submitted a container that requested a very large GPU node. a-d-c got back the following error:

Aug 2 15:35:08 controller arvados-dispatch-cloud1935906: {"ClusterID":"xxxxx","InstanceType":"g548xlarge","PID":1935906,"error":"VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 64 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.\n\tstatus code: 400, request id: 1ba37a53-2ad1-4b41-9810-4b6befdc0336","level":"error","msg":"create failed","time":"2024-08-02T15:35:08.334899663Z"}

This puts a-d-c in the 'AtQuota' state where it won't start any more instances.

The problem is, there were a bunch of other containers using standard instance types in the queue behind this one. Because a-d-c doesn't distinguish that 'g' instances and the other ones ('m', 'c', 'r' etc) are subject to separate AWS quotas, instead of being unable to schedule just the GPU container, it was unable to start any additional instances or schedule any of the other containers

Proposed solution

The simplest solution is for VcpuLimitExceeded to be treated as a "capacity" error (tied to an individual instance type) instead of a "quota" error (which blocks everything).

This allows for situations where a large container cannot run, but smaller container requests can still be scheduled. The only drawback is that this could lead to starvation situation where the small containers continue to be scheduled in the gap, preventing sufficient capacity from opening up to run the large container.

Another solution would be to explicitly model instance families (add a "Family" field to InstanceType) and have certain errors affect a group of instance types instead of individual ones. This would address the situation where, for example, a m5.16xlarge is at the head of the queue, but can't be scheduled, but if the whole "m5" family is blocked, it wouldn't launch any smaller "m5" instances either, making it more likely for capacity to eventually free up.


Files

AWS EC2 quotas.png (272 KB) AWS EC2 quotas.png Lucas Di Pentima, 09/09/2024 04:28 PM

Subtasks 1 (0 open1 closed)

Task #22036: Review 22017-instance-type-quotasResolvedTom Clegg09/13/2024Actions
Actions

Also available in: Atom PDF