Project

General

Profile

Bug #20950

Updated by Peter Amstutz 8 months ago

I ran a workflow that initial starts started a bunch of c4.large nodes to run workflows. m4.4xlarge nodes.    MaxInstances starts at 2000 and we know that this subnet is sized for about 2048 addresses. 

 At some point, it hits hit quota with "not able to get any more c4.large nodes". m4.4xlarge". 

 At this point, max concurrent containers drops from 2000 to 396.    The number of running containers at this point is 339 but it's also booting about 54 nodes at this point. 

 Because the running containers came down, this results in a bunch of the booting c4.large nodes being aborted. 

 After a minute, the AtQuota state clears, and it starts to boot m4.4xlarge nodes (because it's decided to stop trying to boot supervisor nodes, and started to boot worker nodes, so far so good). 

 Over the course of the next ten minutes it boots about 137 m4.4xlarge nodes and the max concurrent containers adjusts upward to 484.    Then it hits AtQuota again, this time not being able to get any more m4.4xlarge nodes. 

 It remains in AtQuota state for the next ten minutes or so.    At this point, I notice that stopped the new containers has flatlined, and I look at the logs and see the quota error for m4.4xlarge. 

 I cancelled the workflow and tweaked it to use smaller m4.2xlarge nodes. 

 Then I restarted the workflow.    Controller is left alone. 

 On the second run, controller boots nodes steadily until hitting an apparently self-imposed cap around 482 nodes. 

 During the second run, there have been Despite no recent quota errors or 503 errors, and most/all the nodes from dispatcher seems to hit a self-imposed ceiling slightly above the previous run point where it had been shut down. previously hit quota. 

 The expectation is that it should would cautiously continue to attempt to boot nodes, but instead it seems to be stuck. 

 I guess this is the "wait for a node to go away before increasing quota" logic?    Except this batch of nodes is going to continue to be used for a while (and when a task finishes, there's a queue tasked ready to replace it, so it's unlikely that any nodes will be shut down for several hours). 

Back