Feature #20755
closedSupport multiple subnets in arvados-dispatch-cloud
Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.
Description
Customer has two "burst" subnets in two different availability zones. We would like to be able to use both subnets. The options to implement this seem to be either:
- We run two instances of arvados-dispatch-cloud, each one configured for a different subnet
- A single instance of arvados-dispatch-cloud (or the AWS driver specifically) can be configured with multiple subnets, and round-robin balances launching instances on each subnet
Updated by Tom Clegg over 1 year ago
I think it would work to use a single a-d-c process, but just have it create new instances in subnet#1 until it gets an error that looks like "subnet full", at which point it retries in subnet#2, then subnet#3, etc. If one of the subnets succeeds, that subnet is the one a-d-c starts with on the next create attempt.
For the sake of compatibility we would want to accept both of these spellings in the config:
DriverParameters: SubnetID: abc123
DriverParameters: SubnetID: [abc123, def456]
Updated by Peter Amstutz over 1 year ago
- Target version changed from Future to Development 2023-08-16
Updated by Peter Amstutz over 1 year ago
- Subject changed from Support multiple subnets to Support multiple subnets in arvados-dispatch-cloud
Updated by Tom Clegg over 1 year ago
- Sticking to one subnet as long as it works (see #note-2) is probably better than round-robin (see description) if the subnets are likely to be in different availability zones, in that round-robin would more or less guarantee that a connectivity problem in any subnet would cause all multi-container workflows to fail out.
- We want prometheus metrics for (at least) the number of instances running in each subnet.
- It should be possible/easy to detect when one of the subnets is not usable (as opposed to not even being attempted because a different subnet is working fine).
- Since different subnets can be in different availability zones, "instance type not available" and "insufficient spot instance capacity" errors -- although they don't mention the word subnet -- are potentially solvable by trying a different subnet, so they should be handled as "try a different subnet". (This might cause surprising behavior -- heterogeneous workflows combined with spot instance availability patterns could cause a-d-c to spread instances around on multiple subnets instead of sticking to one working subnet -- but this seems like an acceptable risk for a first implementation.)
Updated by Tom Clegg over 1 year ago
- number of instances running in each subnet (subnet_id is empty if not reported by EC2, or in this case the test stub)
# HELP arvados_dispatchcloud_ec2_instances Number of instances running # TYPE arvados_dispatchcloud_ec2_instances gauge arvados_dispatchcloud_ec2_instances{subnet_id=""} 2
- number of successful/unsuccessful attempts to start instances in each subnet (subnet_id is empty if not specified in config)
# HELP arvados_dispatchcloud_ec2_instance_starts_total Number of attempts to start a new instance # TYPE arvados_dispatchcloud_ec2_instance_starts_total counter arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-full",success="0"} 1 arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-full",success="1"} 0 arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-good",success="0"} 0 arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-good",success="1"} 2
In addition to any error whose code contains Subnet
(e.g., InsufficientFreeAddressesInSubnet
), we handle InsufficientInstanceCapacity
and InsufficientVolumeCapacity
by trying a different subnet.
I'm on the fence about whether we should rename SubnetID
to SubnetIDs
in the config now that it accepts multiple values. We could accept both, and/or log a warning when the old name is used, error out when both are used, etc. Is it really worth the effort? I'm leaning no.
Unrelated fix: Added InstanceLimitExceeded
and InsufficientAddressCapacity
to our list of error codes that should be handled as quota errors. See https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html
20755-ec2-multiple-subnets @ 6732a23110c9d1c0b4a188d98a81fad4da6705c0 -- developer-run-tests: #3758
Updated by Lucas Di Pentima over 1 year ago
The changes LGTM, but I wonder if the cloudtest
command should test both subnets for completeness' sake, or at least have a flag to enable that kind of test.
Re: SubnetID
vs SubnetIDs
, I agree that the amount of effort needed seem to be too high for that small reward.
Updated by Tom Clegg over 1 year ago
Very good point.
20755-ec2-multiple-subnets @ ad5eb020d76da3ef5927b3d8c364390d42493ddd -- developer-run-tests: #3765
arvados-server cloudtest
now runs the instance lifecycle test once for each configured subnet ID, if there's more than one.
Tried it on tordo by changing config to say
SubnetID: [subnet-00redactedactualid00, subnet-bogus]
(I left a binary in tordo:/root/arvados-server-ad5eb020d76da3ef5927b3d8c364390d42493ddd-dev in case you want to try it. ./arvados-server-2d339d2e6d739da6c5e934257bd8891d6748484c-dev cloudtest -config /etc/arvados/test-config-20755.yml
is handy for this sort of thing.)
Updated by Tom Clegg over 1 year ago
Fixed testing bug (accept empty DriverParameters, which only happens in tests). Merged main to get the fix for the unreliable dispatchcloud test.
20755-ec2-multiple-subnets @ e805640e7904dc282f00be82a2edb53d496a87bb -- developer-run-tests: #3766
Updated by Tom Clegg over 1 year ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|8da1cf8c69337be51a7ace6923d8c0c0bc7d36e1.