Project

General

Profile

Actions

Bug #20984

closed

Do not treat InsufficientInstanceCapacity as quota error

Added by Peter Amstutz 7 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
1.0
Release relationship:
Auto

Description

InsufficientInstanceCapacity means there isn't an instance available of a given instance type, but we might be able to boot a different instance type, and that instance type might become available if other cloud users shut down instances.

Thus it doesn't make sense to go into the "at quota" condition which freezes instance starts until we shut down one of our own instances. This may be as simple as removing it from the isCodeCapacity map in ec2.go.


Subtasks 1 (0 open1 closed)

Task #21006: Review 20984-instance-capacityResolvedPeter Amstutz10/17/2023Actions

Related issues

Related to Arvados Epics - Idea #20599: Scaling to 1000s of concurrent containersResolved06/01/202303/31/2024Actions
Actions #1

Updated by Peter Amstutz 7 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 7 months ago

  • Status changed from In Progress to New
  • Category set to Crunch
Actions #3

Updated by Peter Amstutz 7 months ago

  • Related to Idea #20599: Scaling to 1000s of concurrent containers added
Actions #4

Updated by Peter Amstutz 7 months ago

  • Story points set to 1.0
  • Target version changed from Future to To be scheduled
  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 7 months ago

  • Release set to 67
Actions #6

Updated by Peter Amstutz 7 months ago

  • Target version changed from To be scheduled to Development 2023-10-11 sprint
Actions #7

Updated by Peter Amstutz 7 months ago

  • Assigned To set to Tom Clegg
Actions #8

Updated by Peter Amstutz 7 months ago

The current behavior has some very bad failure modes. A user launched a pipeline which asked for a large node (m4.10xlarge) and got InsufficientInstanceCapacity after only 3 instances had been created; this caused the dispatcher to completely stop trying to start nodes and lowered the dynamic max instances down to 3. As a result it became starved because the instances already running were waiting on the worker instance to start, but dispatcher was waiting for an instance to shut down before it would try starting a new one.

Instead of going completely silent on quota error, I think we want to either go back to the old behavior (1 minute quiet period) or implement an exponential back off behavior (wait for 15 seconds, then 30 seconds, then 60 seconds, then 2 minutes, etc). An instance shutdown can still be used as a signal to try starting a new instance if it is in the quiet period, but a quiet period of indefinite length is turning out to be bad behavior -- the correct assumption is that we're sharing the cloud resource with other users and new resources could become available any time without us having to do anything.

Actions #9

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2023-10-11 sprint to Development 2023-10-25 sprint
Actions #10

Updated by Tom Clegg 6 months ago

  • Status changed from New to In Progress

20984-instance-capacity @ 528c6dfb58318bca6deefc3f1097122ddd735203 -- developer-run-tests: #3865

wb1 retry developer-run-tests-apps-workbench-integration: #4180

  • All agreed upon points are implemented / addressed.
    • ✅ "Do not treat InsufficientInstanceCapacity as quota error" is addressed. Related quota behavior mentioned in #note-8 is not addressed.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • ✅ New test added. No manual testing.
  • Documentation has been updated.
    • N/A
  • Behaves appropriately at the intended scale (describe intended scale).
    • No scaling issues anticipated
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A, does not affect interaction with other components
  • Follows our coding standards and GUI style guidelines.
The branch adds a CapacityError interface type which the ec2 driver uses to convey whether the error is
  • a provider capacity issue, and
  • specific to the requested instance type.
After receiving such an error, the worker pool sets an "at capacity" flag for 1 minute. While that flag is set, the worker pool
  • does not attempt to create new instances [of the affected type], and
  • returns true when the scheduler calls AtCapacity() [with the affected type].

Before locking a queued container or creating a new instance, the scheduler checks AtCapacity() for the relevant instance type. If true, it skips that container.

The differences between InsufficientInstanceCapacity and quota error handling are
  • When we're at quota and we shut down an existing instance of any type, we stop assuming we're still at quota -- but this does not affect our cached AtCapacity state. (We don't even bother resetting the cached AtCapacity state after shutting down an instance of the affected type -- if we need it, we wouldn't have shut it down ourselves, and if it shut down without us asking, then it's probably not returning to the pool of available instances right away.)
  • When we see InsufficientInstanceCapacity we continue to run lower-priority containers that use different instance types (i.e., containers that need unavailable instance types don't starve out containers that need available instance types). But when we hit a quota error, we don't try to run lower-priority containers because running instance type A for a lower-priority container can actually prevent us from creating instance type B for a higher-priority container (i.e., many lowe-priority containers that each consume small amounts of quota don't starve out higher-priority containers that need a large amount of quota all at once).
Actions #11

Updated by Peter Amstutz 6 months ago

When we see InsufficientInstanceCapacity we continue to run lower-priority containers that use different instance types (i.e., containers that need unavailable instance types don't starve out containers that need available instance types). But when we hit a quota error, we don't try to run lower-priority containers because running instance type A for a lower-priority container can actually prevent us from creating instance type B for a higher-priority container (i.e., many lowe-priority containers that each consume small amounts of quota don't starve out higher-priority containers that need a large amount of quota all at once).

Since we're making effort to do this, perhaps we should support the concept of instance families.

For example, let's say we want a m4.10xlarge, but we can't get it. But we can get a m4.2xlarge. This may create the situation you're describing here, where allocating containers requesting smaller, lower-priority instance types in a given family interferes with the ability to get the higher priority, larger instance type. If we have instance type families, then instead of specifically being at capacity for m4.10xlarge, we would be at capacity across all m4 instance types -- but we could still start a c4 or a t2 instance.

I propose adding a "Family" field to "InstanceType". If "Family" is non-empty then we use that for capacity logic, otherwise we fall back on "ProviderType".

Actions #12

Updated by Tom Clegg 6 months ago

https://docs.aws.amazon.com/general/latest/gr/ec2-service.html suggests that at least for AWS, we could figure out programatically which instance types a quota error applies to based on ProviderType, without adding a config (there is one quota for standard (A, C, D, H, I, M, R, T, Z) instances, a separate quota for F instances, ...). So, apparently m4 and c2 share one quota, while x1 and x1e share a separate quota.

Tracking separate quotas would certainly help anyone who uses non-standard instance types.

Meanwhile... should I merge the InsufficientInstanceCapacity fix?

Actions #13

Updated by Tom Clegg 6 months ago

I'm not sure I interpreted #note-11 correctly. If you really mean capacity errors, I don't think creating m4.2xlarge instances ever prevents the provider from finding more supply of m4.10xlarge instances -- does it?

The premise of this bugfix is that, given a higher-priority m4.10xlarge container and a lower-priority m4.2xlarge container,
  • if we fail to create an m4.10xlarge instance because InstanceLimitExceeded (quota), then we don't want to create an m4.2xlarge instance (existing behavior)
  • if we fail to create an m4.10xlarge instance because InsufficientInstanceCapacity, then we do want to create an m4.2xlarge instance (changed behavior)
Actions #14

Updated by Peter Amstutz 6 months ago

From our discussion, I don't think my interpretation of machine families is necessary right or desirable, so I retract the suggestion.

However, one additional consideration -- if we are hitting capacity errors, then we probably want to adjust the dynamic maxConcurrency (which affects maxSupervisors).

Actions #15

Updated by Tom Clegg 6 months ago

Peter Amstutz wrote in #note-14:

However, one additional consideration -- if we are hitting capacity errors, then we probably want to adjust the dynamic maxConcurrency (which affects maxSupervisors).

Can you elaborate? I'm not sure what situation you have in mind here.

Actions #16

Updated by Peter Amstutz 6 months ago

Tom Clegg wrote in #note-15:

Peter Amstutz wrote in #note-14:

However, one additional consideration -- if we are hitting capacity errors, then we probably want to adjust the dynamic maxConcurrency (which affects maxSupervisors).

Can you elaborate? I'm not sure what situation you have in mind here.

  1. maxSupervisors is a fraction of maxConcurrency
  2. if we over-estimate actual capacity, then we can end up starting too many supervisors, which can starve out workers
  3. if we never reduce maxConcurrency in response to capacity errors, then we'll continue to over-estimate maxSupervisors

So I feel like if we're in a situation where every instance type we want is getting a capacity error, it would make sense to reset maxConcurrency/maxSupervisors and then probe upward 10% at a time.

Actions #17

Updated by Tom Clegg 6 months ago

OK, I think I see what you're saying. In a case where supervisors and their children use the same instance type, we could fill up all of the available capacity with supervisor containers, and then effectively deadlock until the provider increases their capacity / other cloud users free up some capacity.

First, I don't think this branch makes our handling of this situation any worse. As before, with SupervisorFraction=0.5, maxConcurrency might get up to 100 when many different instance types are running, so we might start 50 supervisors without anticipating that the 50 t1.rare instances the supervisors need are the only ones that exist and the supervisors' children also all need to run on t1.rare. I say this because if indeed this bugfix branch is not creating any new problems, I think we should merge it while we consider other related improvements.

In principle I think we could address that scenario (somewhat) by tracking the last known capacity limit for each instance type, and enforcing SupervisorFraction as a fraction of last-known-capacity figure, in addition to enforcing it as a fraction of maxConcurrency. However, I'm not sure we'll get great returns from that because
  • I suspect supervisors typically don't use the same instance type as their children
  • I suspect supervisors typically don't run on rare instance types
  • In general we have no idea which instance type(s) a supervisor will need for its children, so we're unlikely to win by guessing

I think we'd be better off thinking of this as a "shouldn't be so fixated on matching each container to exactly one instance type" problem rather than a "shouldn't risk using an instance we might need for something else" problem. We could work toward scheduling multiple containers per instance, so that we can run whichever 2-, 4-, 8-, or 16-supervisor-sized instance types happen to be available, and typically get much higher throughput.

Actions #18

Updated by Peter Amstutz 6 months ago

I think I found one misunderstanding. For some reason I thought it was adjusting maxConcurrency as maxConcurrency + 10%, but is actually based on len(running) + 10%.

So I think this is fine. To minimize weird starvation conditions, we should also recommend setting InitialQuotaEstimate to something like 50% of MaxInstances.

When I changed it from 8 to 0, I think that was because 8 was a really slow start (e.g. it only adds 1 instance at a time until it gets to 20 running instances). So an InitialQuotaEstimate of 32 with a note that you should change it together with MaxInstances would probably be a good idea.

I am also looking forward to #20978, which I think will help a lot by making it more flexible in dealing with capacity issues.

20984-instance-capacity LGTM.

Actions #19

Updated by Tom Clegg 6 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF