Bug #17776

[a-d-c] [ec2] when InsufficientInstanceCapacity is returned, we should throttle node creation.

Added by Ward Vandewege 7 days ago. Updated about 2 hours ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/10/2021
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
-

Subtasks

Task #17784: review 17776-more-throttlingIn ProgressTom Clegg


Related issues

Related to Arvados - Bug #17777: [a-d-c] [ec2] MaxSpotInstanceCountExceeded should throttle creation attempts for preemptible instancesNew

Related to Arvados - Bug #17783: [a-d-c] [ec2] VcpuLimitExceeded should throttle node creation attemptsNew

History

#1 Updated by Ward Vandewege 7 days ago

  • Target version changed from To Be Groomed to 2021-06-23 sprint
  • Assigned To set to Ward Vandewege
  • Status changed from New to In Progress

A very basic approach at 66d3cb88d07eed627903b6db0b1cffb7491d4e34 on branch 17776-more-throttling

#2 Updated by Ward Vandewege 7 days ago

  • Related to Bug #17777: [a-d-c] [ec2] MaxSpotInstanceCountExceeded should throttle creation attempts for preemptible instances added

#3 Updated by Ward Vandewege 7 days ago

  • Related to Bug #17783: [a-d-c] [ec2] VcpuLimitExceeded should throttle node creation attempts added

#4 Updated by Tom Clegg 5 days ago

For detecting the error:
  • I don't think we want to export IsErrorCapacity.
  • The extra isCodeCapacity func seems needlessly verbose all for the sake of saving a few bytes of an unchanging map. Could just do var isCodeCapacity = map[string]bool{"InsufficientInstanceCapacity": true, ...}
For reporting it back to dispatcher:
  • These errors seem more like quota errors than API request limit errors. We have a different interface for quota errors (IsQuotaError() bool), the Azure driver has an example. That way the dispatcher can shut down idle nodes in an effort to free up capacity.

#5 Updated by Ward Vandewege about 2 hours ago

Tom Clegg wrote:

For detecting the error:
  • I don't think we want to export IsErrorCapacity.
  • The extra isCodeCapacity func seems needlessly verbose all for the sake of saving a few bytes of an unchanging map. Could just do var isCodeCapacity = map[string]bool{"InsufficientInstanceCapacity": true, ...}

Yes, all fixed, thanks.

For reporting it back to dispatcher:
  • These errors seem more like quota errors than API request limit errors. We have a different interface for quota errors (IsQuotaError() bool), the Azure driver has an example. That way the dispatcher can shut down idle nodes in an effort to free up capacity.

Thanks! I've updated the branch accordingly. I've also added a basic test for wrapError in the ec2 driver. See 6bb5a84a53e5810e96e56e41cc751d4ebc054580 on branch 17776-more-throttling.

Tests in https://ci.arvados.org/view/Developer/job/developer-run-tests/2527/

Also available in: Atom PDF