atQuota should dynamically lower maxSupervisors
If the initial value of maxInstances is large, and the dispatcher becomes cloud quota limited (as opposed to getting 503 errors from controller) it does not recompute maxSupervisors based on the actual number of instances that can be run. This can result in inefficient usage or starvation because maxSupervisors does not reflect the actual limitations of the cluster.
It would also be nice to have a metric that represents when we are not able to start nodes because we have gotten a capacity error from the cloud.
My test case was running out of quota at 250 nodes (InsufficientFreeAddressesInSubnet) and this wasn't apparent from the dashboard, I had to dig into the syslog, and even then it was hard to find.
This takes AtQuota() into account when applying the max-supervisor limit, and adds an
Also, when hitting quota, the scheduler reduces its self-imposed concurrency level to the current level, so it only gets raised 10% at a time after that point. Essentially, once it sees its first quota error, its self-imposed limit will approximately track the cloud quota, even if the cloud quota raises (e.g., the quota is based on #CPUs so the effective instance limit is variable).However, this still doesn't address the following fairly obvious/likely scenario:
- Set MaxInstances=1000, MaxSupervisors=0.4
- Queue has 400 supervisors when a-d-c starts up
- Scheduler starts 250 supervisors successfully
- Cloud reports quota error
- Scheduler reduces max supervisors target to 100, but it's too late
I think we could address this by having a configurable initial value for the self-imposed concurrency limit (right now it's unlimited). A low initial value (say 8) would give a-d-c a "soft start" effect where it waits for the first few supervisors to start some child containers before pushing up the
max_concurrent_containers limit and admitting more supervisors.
InitialConcurrencyLimit or ...?