Bug #22017
closeda-d-c needs to handle different quotas for difference instance types
Description
Diagnosed the following problem on a user cluster:
a) User submitted a container that requested a very large GPU node. a-d-c got back the following error:
Aug 2 15:35:08 controller arvados-dispatch-cloud1935906: {"ClusterID":"xxxxx","InstanceType":"g548xlarge","PID":1935906,"error":"VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 64 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.\n\tstatus code: 400, request id: 1ba37a53-2ad1-4b41-9810-4b6befdc0336","level":"error","msg":"create failed","time":"2024-08-02T15:35:08.334899663Z"}
This puts a-d-c in the 'AtQuota' state where it won't start any more instances.
The problem is, there were a bunch of other containers using standard instance types in the queue behind this one. Because a-d-c doesn't distinguish that 'g' instances and the other ones ('m', 'c', 'r' etc) are subject to separate AWS quotas, instead of being unable to schedule just the GPU container, it was unable to start any additional instances or schedule any of the other containers
Proposed solution¶
The simplest solution is for VcpuLimitExceeded to be treated as a "capacity" error (tied to an individual instance type) instead of a "quota" error (which blocks everything).
This allows for situations where a large container cannot run, but smaller container requests can still be scheduled. The only drawback is that this could lead to starvation situation where the small containers continue to be scheduled in the gap, preventing sufficient capacity from opening up to run the large container.
Another solution would be to explicitly model instance families (add a "Family" field to InstanceType) and have certain errors affect a group of instance types instead of individual ones. This would address the situation where, for example, a m5.16xlarge is at the head of the queue, but can't be scheduled, but if the whole "m5" family is blocked, it wouldn't launch any smaller "m5" instances either, making it more likely for capacity to eventually free up.
Files
Updated by Tom Clegg 6 months ago
According to https://aws.amazon.com/ec2/faqs/#EC2_On-Demand_Instance_limits there are exactly 5 (or perhaps exactly 6) vCPU-based instance limits. We could determine the appropriate instance family automatically from ProviderType.
Updated by Peter Amstutz 5 months ago
- Target version changed from Development 2024-08-28 sprint to Development 2024-09-11 sprint
Updated by Tom Clegg 5 months ago
- Status changed from New to In Progress
22017-instance-type-quotas @ 871a5ab2d9091ec9ba9c23fcfda704270ff0bcc2 -- developer-run-tests: #4427
Still needs tests.
Updated by Tom Clegg 5 months ago
22017-instance-type-quotas @ 2989422c0e7b11c3a2d39079bb9e30779bf45bca -- developer-run-tests: #4431
- All agreed upon points are implemented / addressed.
- The dispatcher adds a concept of "instance family" and a CapacityError interface method that lets a driver report "instance family is at capacity".
- Each driver is responsible for reporting the instance family of a given instance type.
- The EC2 driver reports instance family based on ProviderType, based on the AWS docs. Other drivers report a single instance family "".
- The EC2 driver reports VcpuLimitExceeded errors as instance family specific.
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- N/A
- Code is tested and passing, both automated and manual, what manual testing was done is described
- ✅ test case added
- Documentation has been updated.
- N/A
- Behaves appropriately at the intended scale (describe intended scale).
- ✅ no scaling impact anticipated
- Considered backwards and forwards compatibility issues between client and server.
- ✅ no compatibility issues
- Follows our coding standards and GUI style guidelines.
- ✅
Added a test case comment in 8e1aba0d8bcb6e0c5689c93868d8afba3af73d90, didn't re-run tests.
Updated by Lucas Di Pentima 5 months ago
- File AWS EC2 quotas.png AWS EC2 quotas.png added
I've taken a closer look at AWS documentation and EC2 quota page on our sandbox account, and have the following suggestions:
- There're some other families that are poorly documented
hpc*
is different fromh*
instance typestrn*
is different fromt*
instance typesdl*
is different fromd*
instance typesu*
would be included in the catch-all 'standard' family but they're part of the "high memory instance" quota class. It's not clear which other types are part of this class (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html listsr*
instances as part of the memory optimized series but they belong to the standard family), but maybe we should still separate them from 'standard'.vt*
would also be included in the catch-all 'standard' family but they belong to a separate quota level.
- By looking at the quota level page, I'm also seeing that on-demand is differentiated from spot instances. Is a-d-c treating capacity errors from both classes differently? If not, I think it's important to do so because users could DoS themselves by asking for spot instances of very demanded standard instance types and all standard instance type usage would get blocked for a while.
Updated by Peter Amstutz 5 months ago
My concern is that AWS will continue to introduce new instance types that the ec2 driver doesn't know about, and then we'll have the same problem.
Is there a reason to not go for a simpler solution of using the part before the dot in the instance type names? It could drop trailing digits so you just have initial symbol, so 'm5.xlarge' takes the 'm5' and then becomes just 'm'.
Then we could describe quota groups in config file:
QuotaGroup: standard: m: {} c: {} t: {} x: x: {} g: g: {}
We can have the config default match what we know now, but retain the ability to tweak it in the future.
If an instance is listed in the config file that isn't known to QuotaGroups, it would be an error.
Updated by Lucas Di Pentima 5 months ago
Here's documentation about instance type name components: https://docs.aws.amazon.com/ec2/latest/instancetypes/instance-type-names.html
Detaching type families' series from quota groups with a default group-to-families mapping sounds like a good idea.
Updated by Tom Clegg 5 months ago
Updated using https://docs.aws.amazon.com/ec2/latest/instancetypes/ec2-instance-quotas.html which does include the families missing from the other doc page (hpc, trn, dl, u, vt).
Updated to track quota separately for spot and demand instances.
22017-instance-type-quotas @ 949658f3cede442bf1da4d67a2c9892a1b61945a -- developer-run-tests: #4435
Updated by Tom Clegg 5 months ago
It would be extra nice if AWS offered this information via API, but they don't seem to.
I think it would be more succinct to write the config map as alphaprefix→family, so the default config would be
InstanceTypeFamilies:
a: standard
c: standard
d: standard
g: g
inf: inf
vt: g
...
Updated by Tom Clegg 5 months ago
Alternatively, we could simply treat unrecognized alphabetic prefixes as being their own distinct instance families, instead of defaulting to "standard". It only takes one extra "create instance" attempt/error to set the "new family [that's not actually a distinct family] is also at capacity" flag, so behavior with an outdated list would be nearly indistinguishable from the desired behavior.
Updated by Peter Amstutz 5 months ago
- Target version changed from Development 2024-09-11 sprint to Development 2024-09-25 sprint
Updated by Tom Clegg 5 months ago
- rename "instance family" to "instance quota group" because "instance family" means something different in AWS
- allow configuring additional instance quota groups
- treat unrecognized alphabetic prefixes as being their own distinct instance quota groups
- for demand instances, P2, P3, P4, P5 are all in the same quota group
- for spot instances, P2, P3, P4 are all in the same quota group, and P5 is in its own quota group
- Rather than invent a config language to express this, the current implementation puts P5 in its own instance group for both spot and demand instances. This means a few avoidable failed "create" calls in the demand case, but I don't think that's a big deal.
Updated by Peter Amstutz 5 months ago
Just curious what happens here:
- families A and B are in the same quota group
- the config file doesn't reflect that, so they're considered to be in separate quota groups
- we create an instance in of type A, the quota is now full
- we try to create an instance type B but get back "quota full" error
- the "B" quota group is now in "at quota" state, which holds off on creating further instances until an instance in quota group "B" has been shut down
Is it now stuck forever? Under what conditions will it try again to create an instance of type B? Do we need a special case where we get a quota error but zero instances have been started?
Updated by Tom Clegg 5 months ago
Peter Amstutz wrote in #note-16:
Is it now stuck forever? Under what conditions will it try again to create an instance of type B?
Our "at capacity for group B" state lasts for one minute. At that point we wake up the scheduler, and (assuming it's still needed) it will try to create a new instance of type B. If the cloud provider returns another VcpuLimitExceeded error, repeat.
This might be a little confusing because although our "quota error" handling does wait for an instance shutdown before trying again, and the AWS language suggests this is a quota error, we are now treating this as a "capacity error" instead, which means we just wait 1 minute and try again.
Updated by Peter Amstutz 5 months ago
Tom Clegg wrote in #note-17:
Peter Amstutz wrote in #note-16:
Is it now stuck forever? Under what conditions will it try again to create an instance of type B?
Our "at capacity for group B" state lasts for one minute. At that point we wake up the scheduler, and (assuming it's still needed) it will try to create a new instance of type B. If the cloud provider returns another VcpuLimitExceeded error, repeat.
This might be a little confusing because although our "quota error" handling does wait for an instance shutdown before trying again, and the AWS language suggests this is a quota error, we are now treating this as a "capacity error" instead, which means we just wait 1 minute and try again.
Thanks, I had suggested making it a capacity error in the initial writeup, I just had not looked at the code to see that was exactly what you ended up doing.
Updated by Tom Clegg 5 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|d9a9564d93dad861fa654d87387eecc2e4ff93eb.