Project

General

Profile

Actions

Bug #20950

closed

dispatch cloud won't probe over quota

Added by Peter Amstutz 8 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-

Description

I ran a workflow that initial starts a bunch of c4.large nodes to run workflows. MaxInstances starts at 2000 and we know that this subnet is sized for about 2048 addresses.

At some point, it hits quota with "not able to get any more c4.large nodes".

At this point, max concurrent containers drops from 2000 to 396. The number of running containers at this point is 339 but it's also booting about 54 nodes at this point.

Because the running containers came down, this results in a bunch of the booting c4.large nodes being aborted.

After a minute, the AtQuota state clears, and it starts to boot m4.4xlarge nodes (because it's decided to stop trying to boot supervisor nodes, and started to boot worker nodes, so far so good).

Over the course of the next ten minutes it boots about 137 m4.4xlarge nodes and the max concurrent containers adjusts upward to 484. Then it hits AtQuota again, this time not being able to get any more m4.4xlarge nodes.

It remains in AtQuota state for the next ten minutes or so. At this point, I notice that the new containers has flatlined, and I look at the logs and see the quota error for m4.4xlarge.

I cancelled the workflow and tweaked it to use smaller m4.2xlarge nodes.

Then I restarted the workflow. Controller is left alone.

On the second run, controller boots nodes steadily until hitting an apparently self-imposed cap around 482 nodes.

During the second run, there have been no quota errors, and most/all the nodes from the previous run had been shut down.

The expectation is that it should cautiously continue to attempt to boot nodes, but instead it seems to be stuck.

Actions

Also available in: Atom PDF