Project

General

Profile

Actions

Feature #20978

closed

Support multiple candidate instance types to assign containers

Added by Peter Amstutz 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Start date:
10/31/2023
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release relationship:
Auto

Description

Arvados InstanceTypes list will include all instance types across all availablity zones that we might possibly want to use.

Containers have a set of candidate instance types instead of a single one, this is based on finding the cheapest instance type and any instance types that are within 1.5x (configurable?) cost of the cheapest type. This list of candidates will be sorted in ascending order by cost.

Launching a container will first check if there is an idle instance of any of the candidate instance types.

If none are available, it will next try to start a new instance. It will start with the cheapest and try to launch it in each configured subnet. If it is not available in any subnet, then try the next candidate instance type and so on.

If there is an error that states an instance type is unknown / never available (which is distinct from a "not available due to capacity" error) we should cache the result. Alternately, we can use the AWS API to find which instance types are available in a given subnet/availability zone and use that. Or we can skip this optimization entirely.

Example command line call to get instance types in a given availability zone:

aws ec2 describe-instance-type-offerings --filters "Name=location,Values=us-east-1e" --location-type availability-zone --region us-east-1

Which presumably corresponds to this:

https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTypeOfferings.html


Subtasks 1 (1 open0 closed)

Task #21066: Review 20978-instance-typesIn ProgressTom Clegg10/31/2023

Actions

Related issues

Related to Arvados Epics - Story #20599: Scaling to 1000s of concurrent containersResolved06/01/202303/31/2024

Actions
Related to Arvados Epics - Story #18179: Better spot instance supportIn Progress03/01/202203/31/2024

Actions
Related to Arvados - Feature #20983: Tool to automatically populate Arvados InstanceTypesNewBrett Smith

Actions
Related to Arvados - Feature #21175: Do not retry after "unsupported instance type" EC2 errorsNew

Actions
Related to Arvados - Feature #16316: a-c-r handles resource range requests (especially CPU) and adjusts requests based on what is in InstanceTypes listIn ProgressAlex Coleman

Actions
Actions #1

Updated by Peter Amstutz 5 months ago

  • Related to Story #20599: Scaling to 1000s of concurrent containers added
Actions #2

Updated by Peter Amstutz 5 months ago

  • Related to Story #18179: Better spot instance support added
Actions #3

Updated by Peter Amstutz 5 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 5 months ago

  • Target version changed from To be groomed to To be scheduled
  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 5 months ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz 5 months ago

  • Target version changed from To be scheduled to Development 2023-10-25 sprint
Actions #7

Updated by Peter Amstutz 5 months ago

  • Subject changed from Support multiple cloud availability zones with different available machine types to Support multiple candidate instance types to assign containers
Actions #8

Updated by Peter Amstutz 5 months ago

  • Description updated (diff)
Actions #9

Updated by Peter Amstutz 5 months ago

  • Story points set to 5.0
Actions #10

Updated by Peter Amstutz 5 months ago

  • Story points changed from 5.0 to 3.0
Actions #11

Updated by Peter Amstutz 5 months ago

  • Related to Feature #20983: Tool to automatically populate Arvados InstanceTypes added
Actions #12

Updated by Peter Amstutz 5 months ago

  • Assigned To set to Peter Amstutz
Actions #13

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Actions #14

Updated by Peter Amstutz 4 months ago

  • Assigned To deleted (Peter Amstutz)
Actions #15

Updated by Peter Amstutz 4 months ago

  • Assigned To set to Tom Clegg
Actions #16

Updated by Tom Clegg 4 months ago

20978-instance-types @ 1028c0630dac2a2bff363da1390bbf942e7fe7ae -- developer-run-tests: #3883

  • All agreed upon points are implemented / addressed.
    • ✅ configurable via MaximumPriceFactor (default 1.5)
    • ✅ Launching a container will first check if there is an idle instance of any of the candidate instance types
    • ✅ start with the cheapest and try to launch it in each configured subnet. If it is not available in any subnet, then try the next candidate instance type and so on
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • ☐ Cache non-capacity errors like "invalid instance type" (returning this as a capacity error would cause dispatcher to DTRT, mostly -- except that there's no need to retry a minute later)
    • ☐ Use AWS API to determine valid instance types in each zone (perhaps this is less compelling if we handle "invalid instance type" well)
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • ✅ Unit tests updated
    • ✅ Simulation test lib/dispatchcloud/dispatcher_test.go now has one instance type (6) that always returns a capacity error, so some containers rely on this feature to complete (see example logs below)
  • Documentation has been updated.
    • ✅ MaximumPriceFactor added with comments
    • ☐ This probably deserves an upgrading note, since the new default behavior is to try spending more money instead of waiting for the ideal instance type to be available
  • Behaves appropriately at the intended scale (describe intended scale).
    • I don't think this will affect the scale a-d-c can support
    • With a sufficiently heterogeneous workload in terms of runtime constraints, this will tend to keep larger instances alive rather than swap them out for smaller ones. If the container lifetimes are short, this can be a net win by reducing boot/shutdown overhead
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.

If MaximumPriceFactor is 1.0 and there are multiple suitable instance types with identical prices, a-d-c will try all of those instance types, whereas before it would only try the one with more RAM.

If MaximumPriceFactor < 1.0 then a-d-c behaves as if it was 1.0.

Some selected lines from lib/dispatchcloud test, in which providertype6 always returns a capacity error and MaximumPriceFactor is 1.5:

time="2023-10-31T17:04:02.212487889-04:00" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000078 InstanceType="{type6 providertype6 6 6442450944 0 0 0 0.738 false {  0}}" 
test: returning capacity error for instance type providertype6
time="2023-10-31T17:04:02.293143335-04:00" level=info msg="creating new instance" ContainerUUID=zzzzz-dz642-000000000000078 InstanceType="{type8 providertype8 8 8589934592 0 0 0 0.984 false {  0}}" 
time="2023-10-31T17:04:02.490912133-04:00" level=info msg="[test] starting crunch-run stub" ContainerUUID=zzzzz-dz642-000000000000078 Instance="inst10,providertype8" PID=5
Actions #17

Updated by Tom Clegg 4 months ago

  • Status changed from New to In Progress
Actions #18

Updated by Peter Amstutz 4 months ago

20978-instance-types @ 1028c0630dac2a2bff363da1390bbf942e7fe7ae

In node_size.go:181 the VCPUs check is duplicated.

If we aren't going to call DescribeInstanceTypeOfferings then we need to handle the "Unsupported" error and/or look for the words "Availability Zone" in the error message:

$ aws --profile playground ec2 run-instances --image-id ami-01cbd6fb77f29928f --count 1 --instance-type t3.micro --subnet-id subnet-01e73d84dd51c3423 --region us-east-1

An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (t3.micro) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f.

The way it is currently set up, we may want "Unsupported" to be considered an instance type specific capacity error.

I'm thinking of a situation where the first subnet returns "InsufficientInstanceCapacity" and the second subnet returns "Unsupported", currently we return the last error, which would be "Unsupported".

Alternately, if every create attempt returns an error in ec2InstanceSet.Create, and at least one of them is "InsufficientInstanceCapacity", that would take precedence. My comment about treating "Unsupported" as subnet specific still applies.

Although, it's also possible we could get "Unsupported" for every error (because there's a typo in the configuration). In that case, instead of being stuck, we still want to fall back to trying a different candidate instance type -- which we get with the "AtCapacity" behavior.

run_queue.go:283

            if dontstart[unallocType] {
                // We already tried & failed to start
                // a higher-priority container on the
                // same instance type. Don't let this
                // one sneak in ahead of it.
            } else if sch.pool.KillContainer(ctr.UUID, "about to start") {
                logger.Info("not restarting yet: crunch-run process from previous attempt has not exited")
            } else if sch.pool.StartContainer(unallocType, ctr) {
                logger.Trace("StartContainer => true")

We don't check unallocOK, so could unallocType be nil here?

Actions #19

Updated by Tom Clegg 4 months ago

Peter Amstutz wrote in #note-18:

In node_size.go:181 the VCPUs check is duplicated.

Oops, that was supposed to be a fallback to Scratch. Fixed.

If we aren't going to call DescribeInstanceTypeOfferings then we need to handle the "Unsupported" error and/or look for the words "Availability Zone" in the error message:

$ aws --profile playground ec2 run-instances --image-id ami-01cbd6fb77f29928f --count 1 --instance-type t3.micro --subnet-id subnet-01e73d84dd51c3423 --region us-east-1

An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (t3.micro) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f.

I updated the ec2 driver to accept code "Unsupported" with a message containing "requested instance type" as an instance-type-specific capacity error. (That should be at least as reliable an indicator as "Availability Zone", right?)

This way "Unsupported" is essentially treated as "always insufficient capacity", which is pretty close.

Do you think we should do "cache that error / never retry that instance type in same subnet" now as well?

The way it is currently set up, we may want "Unsupported" to be considered an instance type specific capacity error.

I'm thinking of a situation where the first subnet returns "InsufficientInstanceCapacity" and the second subnet returns "Unsupported", currently we return the last error, which would be "Unsupported".

I've added "Unsupported" as a subnet-specific error code (regardless of details in message) so we'll try the next subnet even if the message doesn't say anything about instance types.

Alternately, if every create attempt returns an error in ec2InstanceSet.Create, and at least one of them is "InsufficientInstanceCapacity", that would take precedence. My comment about treating "Unsupported" as subnet specific still applies.

The ec2 Create func now returns a capacity error if any of the subnets return a capacity error. This way, if we have one subnet returning "InsufficientFreeAddressesInSubnet" and another returning "InsufficientInstanceCapacity", the scheduler will try other instance types.

We don't check unallocOK, so could unallocType be nil here?

It can't be the zero value -- all cases with unallocOK==false already did continue or break above -- but yeah, this code block has gotten a little hard to follow. I've rearranged it a bit to make it (hopefully) clearer.

20978-instance-types @ a73295e64f58fe027b995e0cca3d103d4e2289ff -- developer-run-tests: #3885

  • All agreed upon points are implemented / addressed.
    • ✅ configurable via MaximumPriceFactor (default 1.5)
    • ✅ Launching a container will first check if there is an idle instance of any of the candidate instance types
    • ✅ start with the cheapest and try to launch it in each configured subnet. If it is not available in any subnet, then try the next candidate instance type and so on
    • ✅ return "unsupported instance type" as a capacity error to cause dispatcher to DTRT, mostly -- except that there's no need to retry a minute later
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • ☐ Cache permanent capacity errors (unsupported instance type) until restart
    • ☐ Use AWS API to determine valid instance types in each zone (perhaps this is less compelling if we handle "invalid instance type" well)
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • ✅ Unit tests updated
    • ✅ Simulation test lib/dispatchcloud/dispatcher_test.go now has one instance type (6) that always returns a capacity error, so some containers rely on this feature to complete (see example logs below)
  • Documentation has been updated.
    • ✅ MaximumPriceFactor added with comments
    • ✅ This probably deserves an upgrading note, since the new default behavior is to try spending more money instead of waiting for the ideal instance type to be available
  • Behaves appropriately at the intended scale (describe intended scale).
    • I don't think this will affect the scale a-d-c can support
    • With a sufficiently heterogeneous workload in terms of runtime constraints, this will tend to keep larger instances alive rather than swap them out for smaller ones. If the container lifetimes are short, this can be a net win by reducing boot/shutdown overhead
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.

With main merged (to avoid conflicts with other upgrade notes) and upgrade note added:

20978-instance-types @ 0e9f41a15027a66c69ef351846b213b2e2075642 -- developer-run-tests: #3886

With missing exported config entry fixed:

20978-instance-types @ 889be1993b1c270b361f3502f33ead1dae95c9a2 -- developer-run-tests: #3887

Actions #20

Updated by Peter Amstutz 4 months ago

This LGTM.

Although, as a stretch goal, I do think "Cache permanent capacity errors (unsupported instance type) until restart" could help avoid logging a lot of spurious failures.

Actions #21

Updated by Peter Amstutz 4 months ago

  • Release set to 67
Actions #22

Updated by Tom Clegg 4 months ago

  • Related to Feature #21175: Do not retry after "unsupported instance type" EC2 errors added
Actions #23

Updated by Tom Clegg 4 months ago

Added #21175 for caching permanent capacity errors, and cancelling containers that will never run due to unsupported node types / bad config.

Actions #24

Updated by Tom Clegg 4 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #25

Updated by Brett Smith 3 months ago

  • Related to Feature #16316: a-c-r handles resource range requests (especially CPU) and adjusts requests based on what is in InstanceTypes list added
Actions

Also available in: Atom PDF