Feature #20755
closedSupport multiple subnets in arvados-dispatch-cloud
100%
Description
Customer has two "burst" subnets in two different availability zones. We would like to be able to use both subnets. The options to implement this seem to be either:
- We run two instances of arvados-dispatch-cloud, each one configured for a different subnet
- A single instance of arvados-dispatch-cloud (or the AWS driver specifically) can be configured with multiple subnets, and round-robin balances launching instances on each subnet
Updated by Tom Clegg 4 months ago
I think it would work to use a single a-d-c process, but just have it create new instances in subnet#1 until it gets an error that looks like "subnet full", at which point it retries in subnet#2, then subnet#3, etc. If one of the subnets succeeds, that subnet is the one a-d-c starts with on the next create attempt.
For the sake of compatibility we would want to accept both of these spellings in the config:
DriverParameters: SubnetID: abc123
DriverParameters: SubnetID: [abc123, def456]
Updated by Peter Amstutz 4 months ago
- Target version changed from To be groomed to Development 2023-08-16
Updated by Peter Amstutz 4 months ago
- Subject changed from Support multiple subnets to Support multiple subnets in arvados-dispatch-cloud
Updated by Tom Clegg 4 months ago
- Sticking to one subnet as long as it works (see #note-2) is probably better than round-robin (see description) if the subnets are likely to be in different availability zones, in that round-robin would more or less guarantee that a connectivity problem in any subnet would cause all multi-container workflows to fail out.
- We want prometheus metrics for (at least) the number of instances running in each subnet.
- It should be possible/easy to detect when one of the subnets is not usable (as opposed to not even being attempted because a different subnet is working fine).
- Since different subnets can be in different availability zones, "instance type not available" and "insufficient spot instance capacity" errors -- although they don't mention the word subnet -- are potentially solvable by trying a different subnet, so they should be handled as "try a different subnet". (This might cause surprising behavior -- heterogeneous workflows combined with spot instance availability patterns could cause a-d-c to spread instances around on multiple subnets instead of sticking to one working subnet -- but this seems like an acceptable risk for a first implementation.)
Updated by Tom Clegg 4 months ago
- number of instances running in each subnet (subnet_id is empty if not reported by EC2, or in this case the test stub)
# HELP arvados_dispatchcloud_ec2_instances Number of instances running # TYPE arvados_dispatchcloud_ec2_instances gauge arvados_dispatchcloud_ec2_instances{subnet_id=""} 2
- number of successful/unsuccessful attempts to start instances in each subnet (subnet_id is empty if not specified in config)
# HELP arvados_dispatchcloud_ec2_instance_starts_total Number of attempts to start a new instance # TYPE arvados_dispatchcloud_ec2_instance_starts_total counter arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-full",success="0"} 1 arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-full",success="1"} 0 arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-good",success="0"} 0 arvados_dispatchcloud_ec2_instance_starts_total{subnet_id="subnet-good",success="1"} 2
In addition to any error whose code contains Subnet
(e.g., InsufficientFreeAddressesInSubnet
), we handle InsufficientInstanceCapacity
and InsufficientVolumeCapacity
by trying a different subnet.
I'm on the fence about whether we should rename SubnetID
to SubnetIDs
in the config now that it accepts multiple values. We could accept both, and/or log a warning when the old name is used, error out when both are used, etc. Is it really worth the effort? I'm leaning no.
Unrelated fix: Added InstanceLimitExceeded
and InsufficientAddressCapacity
to our list of error codes that should be handled as quota errors. See https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html
20755-ec2-multiple-subnets @ 6732a23110c9d1c0b4a188d98a81fad4da6705c0 -- developer-run-tests: #3758
Updated by Lucas Di Pentima 4 months ago
The changes LGTM, but I wonder if the cloudtest
command should test both subnets for completeness' sake, or at least have a flag to enable that kind of test.
Re: SubnetID
vs SubnetIDs
, I agree that the amount of effort needed seem to be too high for that small reward.
Updated by Tom Clegg 4 months ago
Very good point.
20755-ec2-multiple-subnets @ ad5eb020d76da3ef5927b3d8c364390d42493ddd -- developer-run-tests: #3765
arvados-server cloudtest
now runs the instance lifecycle test once for each configured subnet ID, if there's more than one.
Tried it on tordo by changing config to say
SubnetID: [subnet-00redactedactualid00, subnet-bogus]
(I left a binary in tordo:/root/arvados-server-ad5eb020d76da3ef5927b3d8c364390d42493ddd-dev in case you want to try it. ./arvados-server-2d339d2e6d739da6c5e934257bd8891d6748484c-dev cloudtest -config /etc/arvados/test-config-20755.yml
is handy for this sort of thing.)
Updated by Tom Clegg 4 months ago
Fixed testing bug (accept empty DriverParameters, which only happens in tests). Merged main to get the fix for the unreliable dispatchcloud test.
20755-ec2-multiple-subnets @ e805640e7904dc282f00be82a2edb53d496a87bb -- developer-run-tests: #3766
Updated by Tom Clegg 4 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|8da1cf8c69337be51a7ace6923d8c0c0bc7d36e1.