Bug #21603
closedNot recognizing subnet error returned as InvalidParameterValue
Description
Mar 18 03:53:48 ip-172-25-144-184 arvados-dispatch-cloud[283002]: {"ClusterID":"xxxxx","InstanceType":"r52xlarge.preemptible","PID":283002,"error":"InvalidParameterValue: Not enough free addresses in subnet subnet-0f83ca79\n\tstatus code: 400, request id: 6cbcffe1-5b77-4dee-8fbf-c20f67892c95","level":"error","msg":"create failed","time":"2024-03-18T03:53:48.927972989Z"}
This is a subnet-specific error (it should switch to the other subnet) but the current function won't recognize it as such:
func isErrorSubnetSpecific(err error) bool { aerr, ok := err.(awserr.Error) if !ok { return false } code := aerr.Code() return strings.Contains(code, "Subnet") || code == "InsufficientInstanceCapacity" || code == "InsufficientVolumeCapacity" || code == "Unsupported" }
Because the error was unrecognized, it seems the fallback behavior seems to be to rate limit itself by setting maximum concurrent containers.
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Updated by Tom Clegg 10 months ago
See test case for links to AWS docs and some more variations on this.
21603-ec2-subnet-error @ b541c9d898d3dde983de2e0ea80a40e17d4c9b9f -- developer-run-tests: #4094
Updated by Peter Amstutz 10 months ago
Tom Clegg wrote in #note-6:
See test case for links to AWS docs and some more variations on this.
21603-ec2-subnet-error @ b541c9d898d3dde983de2e0ea80a40e17d4c9b9f -- developer-run-tests: #4094
LGTM
Updated by Peter Amstutz 10 months ago
Mar 20 13:11:09 ip-172-25-144-184 arvados-dispatch-cloud[408747]: {"ClusterID":"xxxxx","PID":408747,"SubnetID":"subnet-0f83ca79","error":"InsufficientFreeAddressesInSubnet: There are not enough free addresses in subnet 'subnet-0f83ca79' to satisfy the requested number of instances.\n\tstatus code: 400, request id: 74c87ca8-f432-4e2e-a81c-b73a25284a5f","level":"warning","msg":"RunInstances failed, trying next subnet","time":"2024-03-20T13:11:09.419626602Z"}
Now it's returning the expected error... This fix is still useful in case they screw it up again in the future.
Updated by Tom Clegg 10 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|9d1ff3299a57d0e820bf7975f0f3e6080b22f0a5.