Project

General

Profile

Actions

Feature #20755

closed

Support multiple subnets in arvados-dispatch-cloud

Added by Peter Amstutz 10 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

Customer has two "burst" subnets in two different availability zones. We would like to be able to use both subnets. The options to implement this seem to be either:

  • We run two instances of arvados-dispatch-cloud, each one configured for a different subnet
  • A single instance of arvados-dispatch-cloud (or the AWS driver specifically) can be configured with multiple subnets, and round-robin balances launching instances on each subnet

Subtasks 1 (0 open1 closed)

Task #20814: Review 20755-ec2-multiple-subnetsResolvedLucas Di Pentima08/04/2023Actions
Actions #1

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg 9 months ago

I think it would work to use a single a-d-c process, but just have it create new instances in subnet#1 until it gets an error that looks like "subnet full", at which point it retries in subnet#2, then subnet#3, etc. If one of the subnets succeeds, that subnet is the one a-d-c starts with on the next create attempt.

For the sake of compatibility we would want to accept both of these spellings in the config:

DriverParameters:
  SubnetID: abc123
DriverParameters:
  SubnetID: [abc123, def456]
Actions #3

Updated by Peter Amstutz 9 months ago

  • Target version changed from Future to Development 2023-08-16
Actions #4

Updated by Peter Amstutz 9 months ago

  • Subject changed from Support multiple subnets to Support multiple subnets in arvados-dispatch-cloud
Actions #5

Updated by Peter Amstutz 9 months ago

  • Assigned To set to Tom Clegg
Actions #6

Updated by Tom Clegg 9 months ago

  • Status changed from New to In Progress
Actions #7

Updated by Tom Clegg 9 months ago

From discussion:
  • Sticking to one subnet as long as it works (see #note-2) is probably better than round-robin (see description) if the subnets are likely to be in different availability zones, in that round-robin would more or less guarantee that a connectivity problem in any subnet would cause all multi-container workflows to fail out.
  • We want prometheus metrics for (at least) the number of instances running in each subnet.
  • It should be possible/easy to detect when one of the subnets is not usable (as opposed to not even being attempted because a different subnet is working fine).
  • Since different subnets can be in different availability zones, "instance type not available" and "insufficient spot instance capacity" errors -- although they don't mention the word subnet -- are potentially solvable by trying a different subnet, so they should be handled as "try a different subnet". (This might cause surprising behavior -- heterogeneous workflows combined with spot instance availability patterns could cause a-d-c to spread instances around on multiple subnets instead of sticking to one working subnet -- but this seems like an acceptable risk for a first implementation.)
Actions #8

Updated by Tom Clegg 9 months ago

New metrics:
  • number of instances running in each subnet (subnet_id is empty if not reported by EC2, or in this case the test stub)
    # HELP arvados_dispatchcloud_ec2_instances Number of instances running
    # TYPE arvados_dispatchcloud_ec2_instances gauge
    arvados_dispatchcloud_ec2_instances{subnet_id=""} 2
    
  • number of successful/unsuccessful attempts to start instances in each subnet (subnet_id is empty if not specified in config)
    # HELP arvados_dispatchcloud_ec2_instance_starts_total Number of attempts to start a new instance
    # TYPE arvados_dispatchcloud_ec2_instance_starts_total counter
    arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-full",success="0"} 1
    arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-full",success="1"} 0
    arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-good",success="0"} 0
    arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-good",success="1"} 2
    

In addition to any error whose code contains Subnet (e.g., InsufficientFreeAddressesInSubnet), we handle InsufficientInstanceCapacity and InsufficientVolumeCapacity by trying a different subnet.

I'm on the fence about whether we should rename SubnetID to SubnetIDs in the config now that it accepts multiple values. We could accept both, and/or log a warning when the old name is used, error out when both are used, etc. Is it really worth the effort? I'm leaning no.

Unrelated fix: Added InstanceLimitExceeded and InsufficientAddressCapacity to our list of error codes that should be handled as quota errors. See https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html

20755-ec2-multiple-subnets @ 6732a23110c9d1c0b4a188d98a81fad4da6705c0 -- developer-run-tests: #3758

Actions #9

Updated by Lucas Di Pentima 9 months ago

The changes LGTM, but I wonder if the cloudtest command should test both subnets for completeness' sake, or at least have a flag to enable that kind of test.

Re: SubnetID vs SubnetIDs, I agree that the amount of effort needed seem to be too high for that small reward.

Actions #10

Updated by Tom Clegg 9 months ago

Very good point.

20755-ec2-multiple-subnets @ ad5eb020d76da3ef5927b3d8c364390d42493ddd -- developer-run-tests: #3765

arvados-server cloudtest now runs the instance lifecycle test once for each configured subnet ID, if there's more than one.

Tried it on tordo by changing config to say

          SubnetID: [subnet-00redactedactualid00, subnet-bogus]

(I left a binary in tordo:/root/arvados-server-ad5eb020d76da3ef5927b3d8c364390d42493ddd-dev in case you want to try it. ./arvados-server-2d339d2e6d739da6c5e934257bd8891d6748484c-dev cloudtest -config /etc/arvados/test-config-20755.yml is handy for this sort of thing.)

Actions #11

Updated by Tom Clegg 9 months ago

Fixed testing bug (accept empty DriverParameters, which only happens in tests). Merged main to get the fix for the unreliable dispatchcloud test.

20755-ec2-multiple-subnets @ e805640e7904dc282f00be82a2edb53d496a87bb -- developer-run-tests: #3766

Actions #12

Updated by Lucas Di Pentima 9 months ago

This LGTM, thanks!

Actions #13

Updated by Tom Clegg 9 months ago

  • Status changed from In Progress to Resolved
Actions #14

Updated by Tom Clegg 9 months ago

  • Release set to 66
Actions

Also available in: Atom PDF