Feature #20978
closedSupport multiple candidate instance types to assign containers
Added by Peter Amstutz about 1 year ago. Updated 12 months ago.
Description
Arvados InstanceTypes list will include all instance types across all availablity zones that we might possibly want to use.
Containers have a set of candidate instance types instead of a single one, this is based on finding the cheapest instance type and any instance types that are within 1.5x (configurable?) cost of the cheapest type. This list of candidates will be sorted in ascending order by cost.
Launching a container will first check if there is an idle instance of any of the candidate instance types.
If none are available, it will next try to start a new instance. It will start with the cheapest and try to launch it in each configured subnet. If it is not available in any subnet, then try the next candidate instance type and so on.
If there is an error that states an instance type is unknown / never available (which is distinct from a "not available due to capacity" error) we should cache the result. Alternately, we can use the AWS API to find which instance types are available in a given subnet/availability zone and use that. Or we can skip this optimization entirely.
Example command line call to get instance types in a given availability zone:
aws ec2 describe-instance-type-offerings --filters "Name=location,Values=us-east-1e" --location-type availability-zone --region us-east-1
Which presumably corresponds to this:
https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTypeOfferings.html
Related issues
Updated by Peter Amstutz about 1 year ago
- Related to Idea #20599: Scaling to 1000s of concurrent containers added
Updated by Peter Amstutz about 1 year ago
- Related to Idea #18179: Better spot instance support added
Updated by Peter Amstutz about 1 year ago
- Target version changed from Future to To be scheduled
- Description updated (diff)
Updated by Peter Amstutz about 1 year ago
- Target version changed from To be scheduled to Development 2023-10-25 sprint
Updated by Peter Amstutz about 1 year ago
- Subject changed from Support multiple cloud availability zones with different available machine types to Support multiple candidate instance types to assign containers
Updated by Peter Amstutz about 1 year ago
- Related to Feature #20983: Tool to automatically populate Arvados InstanceTypes added
Updated by Peter Amstutz about 1 year ago
- Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Updated by Tom Clegg about 1 year ago
20978-instance-types @ 1028c0630dac2a2bff363da1390bbf942e7fe7ae -- developer-run-tests: #3883
- All agreed upon points are implemented / addressed.
- ✅ configurable via MaximumPriceFactor (default 1.5)
- ✅ Launching a container will first check if there is an idle instance of any of the candidate instance types
- ✅ start with the cheapest and try to launch it in each configured subnet. If it is not available in any subnet, then try the next candidate instance type and so on
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- ☐ Cache non-capacity errors like "invalid instance type" (returning this as a capacity error would cause dispatcher to DTRT, mostly -- except that there's no need to retry a minute later)
- ☐ Use AWS API to determine valid instance types in each zone (perhaps this is less compelling if we handle "invalid instance type" well)
- Code is tested and passing, both automated and manual, what manual testing was done is described
- ✅ Unit tests updated
- ✅ Simulation test
lib/dispatchcloud/dispatcher_test.go
now has one instance type (6) that always returns a capacity error, so some containers rely on this feature to complete (see example logs below)
- Documentation has been updated.
- ✅ MaximumPriceFactor added with comments
- ☐ This probably deserves an upgrading note, since the new default behavior is to try spending more money instead of waiting for the ideal instance type to be available
- Behaves appropriately at the intended scale (describe intended scale).
- I don't think this will affect the scale a-d-c can support
- With a sufficiently heterogeneous workload in terms of runtime constraints, this will tend to keep larger instances alive rather than swap them out for smaller ones. If the container lifetimes are short, this can be a net win by reducing boot/shutdown overhead
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- ✅
If MaximumPriceFactor is 1.0 and there are multiple suitable instance types with identical prices, a-d-c will try all of those instance types, whereas before it would only try the one with more RAM.
If MaximumPriceFactor < 1.0 then a-d-c behaves as if it was 1.0.
Some selected lines from lib/dispatchcloud test, in which providertype6 always returns a capacity error and MaximumPriceFactor is 1.5:
time="2023-10-31T17:04:02.212487889-04:00" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000078 InstanceType="{type6 providertype6 6 6442450944 0 0 0 0.738 false { 0}}" test: returning capacity error for instance type providertype6 time="2023-10-31T17:04:02.293143335-04:00" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000078 InstanceType="{type8 providertype8 8 8589934592 0 0 0 0.984 false { 0}}" time="2023-10-31T17:04:02.490912133-04:00" level=info msg="[test] starting crunch-run stub" ContainerUUID=zzzzz-dz642-000000000000078 Instance="inst10,providertype8" PID=5
Updated by Peter Amstutz about 1 year ago
20978-instance-types @ 1028c0630dac2a2bff363da1390bbf942e7fe7ae
In node_size.go:181
the VCPUs check is duplicated.
If we aren't going to call DescribeInstanceTypeOfferings then we need to handle the "Unsupported" error and/or look for the words "Availability Zone" in the error message:
$ aws --profile playground ec2 run-instances --image-id ami-01cbd6fb77f29928f --count 1 --instance-type t3.micro --subnet-id subnet-01e73d84dd51c3423 --region us-east-1
An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (t3.micro) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f.
The way it is currently set up, we may want "Unsupported" to be considered an instance type specific capacity error.
I'm thinking of a situation where the first subnet returns "InsufficientInstanceCapacity" and the second subnet returns "Unsupported", currently we return the last error, which would be "Unsupported".
Alternately, if every create attempt returns an error in ec2InstanceSet.Create, and at least one of them is "InsufficientInstanceCapacity", that would take precedence. My comment about treating "Unsupported" as subnet specific still applies.
Although, it's also possible we could get "Unsupported" for every error (because there's a typo in the configuration). In that case, instead of being stuck, we still want to fall back to trying a different candidate instance type -- which we get with the "AtCapacity" behavior.
run_queue.go:283
if dontstart[unallocType] { // We already tried & failed to start // a higher-priority container on the // same instance type. Don't let this // one sneak in ahead of it. } else if sch.pool.KillContainer(ctr.UUID, "about to start") { logger.Info("not restarting yet: crunch-run process from previous attempt has not exited") } else if sch.pool.StartContainer(unallocType, ctr) { logger.Trace("StartContainer => true")
We don't check unallocOK, so could unallocType
be nil
here?
Updated by Tom Clegg about 1 year ago
Peter Amstutz wrote in #note-18:
In
node_size.go:181
the VCPUs check is duplicated.
Oops, that was supposed to be a fallback to Scratch. Fixed.
If we aren't going to call DescribeInstanceTypeOfferings then we need to handle the "Unsupported" error and/or look for the words "Availability Zone" in the error message:
$ aws --profile playground ec2 run-instances --image-id ami-01cbd6fb77f29928f --count 1 --instance-type t3.micro --subnet-id subnet-01e73d84dd51c3423 --region us-east-1
An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (t3.micro) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f.
I updated the ec2 driver to accept code "Unsupported" with a message containing "requested instance type" as an instance-type-specific capacity error. (That should be at least as reliable an indicator as "Availability Zone", right?)
This way "Unsupported" is essentially treated as "always insufficient capacity", which is pretty close.
Do you think we should do "cache that error / never retry that instance type in same subnet" now as well?
The way it is currently set up, we may want "Unsupported" to be considered an instance type specific capacity error.
I'm thinking of a situation where the first subnet returns "InsufficientInstanceCapacity" and the second subnet returns "Unsupported", currently we return the last error, which would be "Unsupported".
I've added "Unsupported" as a subnet-specific error code (regardless of details in message) so we'll try the next subnet even if the message doesn't say anything about instance types.
Alternately, if every create attempt returns an error in ec2InstanceSet.Create, and at least one of them is "InsufficientInstanceCapacity", that would take precedence. My comment about treating "Unsupported" as subnet specific still applies.
The ec2 Create func now returns a capacity error if any of the subnets return a capacity error. This way, if we have one subnet returning "InsufficientFreeAddressesInSubnet" and another returning "InsufficientInstanceCapacity", the scheduler will try other instance types.
We don't check unallocOK, so could
unallocType
benil
here?
It can't be the zero value -- all cases with unallocOK==false already did continue or break above -- but yeah, this code block has gotten a little hard to follow. I've rearranged it a bit to make it (hopefully) clearer.
20978-instance-types @ a73295e64f58fe027b995e0cca3d103d4e2289ff -- developer-run-tests: #3885
- All agreed upon points are implemented / addressed.
- ✅ configurable via MaximumPriceFactor (default 1.5)
- ✅ Launching a container will first check if there is an idle instance of any of the candidate instance types
- ✅ start with the cheapest and try to launch it in each configured subnet. If it is not available in any subnet, then try the next candidate instance type and so on
- ✅ return "unsupported instance type" as a capacity error to cause dispatcher to DTRT, mostly -- except that there's no need to retry a minute later
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- ☐ Cache permanent capacity errors (unsupported instance type) until restart
- ☐ Use AWS API to determine valid instance types in each zone (perhaps this is less compelling if we handle "invalid instance type" well)
- Code is tested and passing, both automated and manual, what manual testing was done is described
- ✅ Unit tests updated
- ✅ Simulation test
lib/dispatchcloud/dispatcher_test.go
now has one instance type (6) that always returns a capacity error, so some containers rely on this feature to complete (see example logs below)
- Documentation has been updated.
- ✅ MaximumPriceFactor added with comments
- ✅ This probably deserves an upgrading note, since the new default behavior is to try spending more money instead of waiting for the ideal instance type to be available
- Behaves appropriately at the intended scale (describe intended scale).
- I don't think this will affect the scale a-d-c can support
- With a sufficiently heterogeneous workload in terms of runtime constraints, this will tend to keep larger instances alive rather than swap them out for smaller ones. If the container lifetimes are short, this can be a net win by reducing boot/shutdown overhead
- Considered backwards and forwards compatibility issues between client and server.
- N/A
- Follows our coding standards and GUI style guidelines.
- ✅
With main merged (to avoid conflicts with other upgrade notes) and upgrade note added:
20978-instance-types @ 0e9f41a15027a66c69ef351846b213b2e2075642 -- developer-run-tests: #3886
With missing exported config entry fixed:
20978-instance-types @ 889be1993b1c270b361f3502f33ead1dae95c9a2 -- developer-run-tests: #3887
Updated by Peter Amstutz 12 months ago
This LGTM.
Although, as a stretch goal, I do think "Cache permanent capacity errors (unsupported instance type) until restart" could help avoid logging a lot of spurious failures.
Updated by Tom Clegg 12 months ago
- Related to Feature #21175: Do not retry after "unsupported instance type" EC2 errors added
Updated by Tom Clegg 12 months ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|ac312e0acae4fd114114081c9f4791d05e640831.
Updated by Brett Smith 11 months ago
- Related to Feature #16316: a-c-r handles resource range requests (especially CPU) and adjusts requests based on what is in InstanceTypes list added