Project

General

Profile

Bug #20950

Updated by Peter Amstutz 8 months ago

I ran a workflow that started a bunch of m4.4xlarge nodes.    At some point, it hit quota with "not able to get any more m4.4xlarge". 

 I stopped the workflow and tweaked it to use m4.2xlarge nodes. 

 Despite no recent quota errors or 503 errors, the dispatcher seems to hit a self-imposed ceiling slightly above the point where it had previously hit quota. 

 The expectation is that it would cautiously continue to attempt to boot nodes, but instead it seems to be stuck. 

 I guess this is the "wait for a node to go away before increasing quota" logic?    Except this batch of nodes is going to continue to be used for a while (and when a task finishes, there's a queue tasked ready to replace it, so it's unlikely that any nodes will be shut down for several hours). 

Back