Project

General

Profile

Actions

Bug #20667

closed

atQuota should dynamically lower maxSupervisors

Added by Peter Amstutz 11 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-
Release relationship:
Auto

Description

If the initial value of maxInstances is large, and the dispatcher becomes cloud quota limited (as opposed to getting 503 errors from controller) it does not recompute maxSupervisors based on the actual number of instances that can be run. This can result in inefficient usage or starvation because maxSupervisors does not reflect the actual limitations of the cluster.

It would also be nice to have a metric that represents when we are not able to start nodes because we have gotten a capacity error from the cloud.

My test case was running out of quota at 250 nodes (InsufficientFreeAddressesInSubnet) and this wasn't apparent from the dashboard, I had to dig into the syslog, and even then it was hard to find.


Subtasks 1 (0 open1 closed)

Task #20675: Review 20667-maxsuper-atquotaResolvedPeter Amstutz06/26/2023Actions
Actions #1

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
  • Subject changed from Quota error should dynamically lower maxConcurrency and maxSupervisors to atQuota should dynamically lower maxSupervisors
Actions #3

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 11 months ago

  • Assigned To set to Tom Clegg
Actions #5

Updated by Tom Clegg 10 months ago

20667-maxsuper-atquota @ c624929279e70d58017eab08ed286dda88bcd215 -- developer-run-tests: #3720

This takes AtQuota() into account when applying the max-supervisor limit, and adds an at_quota metric.

Also, when hitting quota, the scheduler reduces its self-imposed concurrency level to the current level, so it only gets raised 10% at a time after that point. Essentially, once it sees its first quota error, its self-imposed limit will approximately track the cloud quota, even if the cloud quota raises (e.g., the quota is based on #CPUs so the effective instance limit is variable).

However, this still doesn't address the following fairly obvious/likely scenario:
  1. Set MaxInstances=1000, MaxSupervisors=0.4
  2. Queue has 400 supervisors when a-d-c starts up
  3. Scheduler starts 250 supervisors successfully
  4. Cloud reports quota error
  5. Scheduler reduces max supervisors target to 100, but it's too late
  6. Deadlock

I think we could address this by having a configurable initial value for the self-imposed concurrency limit (right now it's unlimited). A low initial value (say 8) would give a-d-c a "soft start" effect where it waits for the first few supervisors to start some child containers before pushing up the max_concurrent_containers limit and admitting more supervisors.

InitialQuotaEstimate or InitialConcurrencyLimit or ...?

Actions #6

Updated by Tom Clegg 10 months ago

20667-maxsuper-atquota @ 5372ee7878d6880081dd4b5481e1820fc7cd1975 -- developer-run-tests: #3722

With InitialQuotaEstimate config.

Actions #7

Updated by Tom Clegg 10 months ago

  • Status changed from New to In Progress
Actions #8

Updated by Tom Clegg 10 months ago

  • Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Actions #9

Updated by Peter Amstutz 10 months ago

Tom Clegg wrote in #note-6:

20667-maxsuper-atquota @ 5372ee7878d6880081dd4b5481e1820fc7cd1975 -- developer-run-tests: #3722

With InitialQuotaEstimate config.

This LGTM.

Sorry it took forever for me to get to.

Actions #10

Updated by Tom Clegg 10 months ago

  • Status changed from In Progress to Resolved
Actions #11

Updated by Peter Amstutz 9 months ago

  • Release set to 66
Actions

Also available in: Atom PDF