Bug #22017
closeda-d-c needs to handle different quotas for difference instance types
Description
Diagnosed the following problem on a user cluster:
a) User submitted a container that requested a very large GPU node. a-d-c got back the following error:
Aug 2 15:35:08 controller arvados-dispatch-cloud1935906: {"ClusterID":"xxxxx","InstanceType":"g548xlarge","PID":1935906,"error":"VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 64 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.\n\tstatus code: 400, request id: 1ba37a53-2ad1-4b41-9810-4b6befdc0336","level":"error","msg":"create failed","time":"2024-08-02T15:35:08.334899663Z"}
This puts a-d-c in the 'AtQuota' state where it won't start any more instances.
The problem is, there were a bunch of other containers using standard instance types in the queue behind this one. Because a-d-c doesn't distinguish that 'g' instances and the other ones ('m', 'c', 'r' etc) are subject to separate AWS quotas, instead of being unable to schedule just the GPU container, it was unable to start any additional instances or schedule any of the other containers
Proposed solution¶
The simplest solution is for VcpuLimitExceeded to be treated as a "capacity" error (tied to an individual instance type) instead of a "quota" error (which blocks everything).
This allows for situations where a large container cannot run, but smaller container requests can still be scheduled. The only drawback is that this could lead to starvation situation where the small containers continue to be scheduled in the gap, preventing sufficient capacity from opening up to run the large container.
Another solution would be to explicitly model instance families (add a "Family" field to InstanceType) and have certain errors affect a group of instance types instead of individual ones. This would address the situation where, for example, a m5.16xlarge is at the head of the queue, but can't be scheduled, but if the whole "m5" family is blocked, it wouldn't launch any smaller "m5" instances either, making it more likely for capacity to eventually free up.
Files