Bug #20667
closedatQuota should dynamically lower maxSupervisors
Description
If the initial value of maxInstances is large, and the dispatcher becomes cloud quota limited (as opposed to getting 503 errors from controller) it does not recompute maxSupervisors based on the actual number of instances that can be run. This can result in inefficient usage or starvation because maxSupervisors does not reflect the actual limitations of the cluster.
It would also be nice to have a metric that represents when we are not able to start nodes because we have gotten a capacity error from the cloud.
My test case was running out of quota at 250 nodes (InsufficientFreeAddressesInSubnet) and this wasn't apparent from the dashboard, I had to dig into the syslog, and even then it was hard to find.
Updated by Peter Amstutz almost 2 years ago
- Description updated (diff)
- Subject changed from Quota error should dynamically lower maxConcurrency and maxSupervisors to atQuota should dynamically lower maxSupervisors
Updated by Tom Clegg almost 2 years ago
20667-maxsuper-atquota @ c624929279e70d58017eab08ed286dda88bcd215 -- developer-run-tests: #3720
This takes AtQuota() into account when applying the max-supervisor limit, and adds an at_quota
metric.
Also, when hitting quota, the scheduler reduces its self-imposed concurrency level to the current level, so it only gets raised 10% at a time after that point. Essentially, once it sees its first quota error, its self-imposed limit will approximately track the cloud quota, even if the cloud quota raises (e.g., the quota is based on #CPUs so the effective instance limit is variable).
However, this still doesn't address the following fairly obvious/likely scenario:- Set MaxInstances=1000, MaxSupervisors=0.4
- Queue has 400 supervisors when a-d-c starts up
- Scheduler starts 250 supervisors successfully
- Cloud reports quota error
- Scheduler reduces max supervisors target to 100, but it's too late
- Deadlock
I think we could address this by having a configurable initial value for the self-imposed concurrency limit (right now it's unlimited). A low initial value (say 8) would give a-d-c a "soft start" effect where it waits for the first few supervisors to start some child containers before pushing up the max_concurrent_containers
limit and admitting more supervisors.
InitialQuotaEstimate
or InitialConcurrencyLimit
or ...?
Updated by Tom Clegg almost 2 years ago
20667-maxsuper-atquota @ 5372ee7878d6880081dd4b5481e1820fc7cd1975 -- developer-run-tests: #3722
With InitialQuotaEstimate
config.
Updated by Tom Clegg over 1 year ago
- Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Updated by Peter Amstutz over 1 year ago
Tom Clegg wrote in #note-6:
20667-maxsuper-atquota @ 5372ee7878d6880081dd4b5481e1820fc7cd1975 -- developer-run-tests: #3722
With
InitialQuotaEstimate
config.
This LGTM.
Sorry it took forever for me to get to.
Updated by Tom Clegg over 1 year ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|f9f0960543c846af8054832c22371c9bc6734615.