Project

General

Profile

Actions

Bug #21603

closed

Not recognizing subnet error returned as InvalidParameterValue

Added by Peter Amstutz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-
Release relationship:
Auto

Description

Mar 18 03:53:48 ip-172-25-144-184 arvados-dispatch-cloud[283002]: {"ClusterID":"xxxxx","InstanceType":"r52xlarge.preemptible","PID":283002,"error":"InvalidParameterValue: Not enough free addresses in subnet subnet-0f83ca79\n\tstatus code: 400, request id: 6cbcffe1-5b77-4dee-8fbf-c20f67892c95","level":"error","msg":"create failed","time":"2024-03-18T03:53:48.927972989Z"}

This is a subnet-specific error (it should switch to the other subnet) but the current function won't recognize it as such:

func isErrorSubnetSpecific(err error) bool {
    aerr, ok := err.(awserr.Error)
    if !ok {
        return false
    }
    code := aerr.Code()
    return strings.Contains(code, "Subnet") ||
        code == "InsufficientInstanceCapacity" ||
        code == "InsufficientVolumeCapacity" ||
        code == "Unsupported" 
}

Because the error was unrecognized, it seems the fallback behavior seems to be to rate limit itself by setting maximum concurrent containers.


Subtasks 1 (0 open1 closed)

Task #21608: Review 21603-ec2-subnet-errorResolvedPeter Amstutz03/20/2024Actions
Actions #1

Updated by Peter Amstutz about 2 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz about 2 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #4

Updated by Peter Amstutz about 2 months ago

  • Description updated (diff)
Actions #5

Updated by Tom Clegg about 2 months ago

  • Target version changed from Development 2024-04-10 sprint to Development 2024-03-27 sprint
  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
Actions #6

Updated by Tom Clegg about 2 months ago

See test case for links to AWS docs and some more variations on this.

21603-ec2-subnet-error @ b541c9d898d3dde983de2e0ea80a40e17d4c9b9f -- developer-run-tests: #4094

Actions #7

Updated by Peter Amstutz about 2 months ago

  • Release set to 69
Actions #8

Updated by Peter Amstutz about 2 months ago

Tom Clegg wrote in #note-6:

See test case for links to AWS docs and some more variations on this.

21603-ec2-subnet-error @ b541c9d898d3dde983de2e0ea80a40e17d4c9b9f -- developer-run-tests: #4094

LGTM

Actions #9

Updated by Peter Amstutz about 2 months ago

Mar 20 13:11:09 ip-172-25-144-184 arvados-dispatch-cloud[408747]: {"ClusterID":"xxxxx","PID":408747,"SubnetID":"subnet-0f83ca79","error":"InsufficientFreeAddressesInSubnet: There are not enough free addresses in subnet 'subnet-0f83ca79' to satisfy the requested number of instances.\n\tstatus code: 400, request id: 74c87ca8-f432-4e2e-a81c-b73a25284a5f","level":"warning","msg":"RunInstances failed, trying next subnet","time":"2024-03-20T13:11:09.419626602Z"}

Now it's returning the expected error... This fix is still useful in case they screw it up again in the future.

Actions #10

Updated by Tom Clegg about 2 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF