Project

General

Profile

Actions

Bug #20950

closed

dispatch cloud won't probe over quota

Added by Peter Amstutz 8 months ago. Updated 8 months ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
Crunch
Story points:
-

Description

I ran a workflow that initial starts a bunch of c4.large nodes to run workflows. MaxInstances starts at 2000 and we know that this subnet is sized for about 2048 addresses.

At some point, it hits quota with "not able to get any more c4.large nodes".

At this point, max concurrent containers drops from 2000 to 396. The number of running containers at this point is 339 but it's also booting about 54 nodes at this point.

Because the running containers came down, this results in a bunch of the booting c4.large nodes being aborted.

After a minute, the AtQuota state clears, and it starts to boot m4.4xlarge nodes (because it's decided to stop trying to boot supervisor nodes, and started to boot worker nodes, so far so good).

Over the course of the next ten minutes it boots about 137 m4.4xlarge nodes and the max concurrent containers adjusts upward to 484. Then it hits AtQuota again, this time not being able to get any more m4.4xlarge nodes.

It remains in AtQuota state for the next ten minutes or so. At this point, I notice that the new containers has flatlined, and I look at the logs and see the quota error for m4.4xlarge.

I cancelled the workflow and tweaked it to use smaller m4.2xlarge nodes.

Then I restarted the workflow. Controller is left alone.

On the second run, controller boots nodes steadily until hitting an apparently self-imposed cap around 482 nodes.

During the second run, there have been no quota errors, and most/all the nodes from the previous run had been shut down.

The expectation is that it should cautiously continue to attempt to boot nodes, but instead it seems to be stuck.

Actions #1

Updated by Peter Amstutz 8 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 8 months ago

  • Category set to Crunch
  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2023-09-13 sprint to Development 2023-09-27 sprint
Actions #4

Updated by Peter Amstutz 8 months ago

  • Status changed from In Progress to New
Actions #5

Updated by Peter Amstutz 8 months ago

  • Assigned To set to Tom Clegg
Actions #6

Updated by Peter Amstutz 8 months ago

A major discrepancy between the first run and the second run:

On the first run, it shows about 200 m4.4xlarge nodes queued once it hits quota.

On the second run, there are no queued m4.2xlarge nodes.

This would be consistent with each workflow runner having exactly one subprocess, but I'm pretty sure the workflow starts by submitting two containers -- but I'm re-running the workflow to verify the behavior now.

(on running the workflow)

Yes, what I'm seeing in workbench is that there are 589 intermediate steps queued, all of which should be using m4.2xlarge nodes.

However, the monitoring says there are 0 queued m4.2xlarge nodes.

All of the workflow runners (c4.large nodes) are accounted for, either Locked or Running.

In Grafana, there are 166 locked m4.2xlarge nodes, but 0 queued. In Workbench, there are 682 queued intermediate steps.

The number of locked m4.2xlarge nodes does seem to be going up.

Actions #7

Updated by Peter Amstutz 8 months ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz 8 months ago

I figured it out.

Due to a typo in the workflow input parameters, the first two container requests submitted for each sample were actually duplicates. So in fact the container queue had exactly a 1:1 ratio of workflow runners to workers, and it wasn't requesting more nodes because it didn't need more nodes.

Oops.

Actions #9

Updated by Peter Amstutz 8 months ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF