Project

General

Profile

Bug #20667

Updated by Peter Amstutz 11 months ago

If the initial value of maxInstances is large, and the dispatcher becomes cloud quota limited (as opposed to getting 503 errors from controller) it does not recompute maxSupervisors based on the actual number of instances that can be run.    This can result in inefficient usage or starvation because maxSupervisors does not reflect the actual limitations of the cluster. 

 It would also be nice to have a metric that represents when we are not able to start nodes because we have gotten a capacity error from the cloud. 

 My test case was running out of quota at 250 nodes (InsufficientFreeAddressesInSubnet) and this wasn't apparent from the dashboard, I had to dig into the syslog, and even then it was hard to find. 

Back