Project

General

Profile

Actions

Bug #20511

closed

High number of "aborted" boot outcomes

Added by Peter Amstutz 20 days ago. Updated about 11 hours ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
05/25/2023
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

I don't know how to interpret this but with arvados-dispatch-cloud running a large job (MaxInstances=400) I am seeing a trend of roughly two "aborted" instances for every "successful" instance (arvados_dispatchcloud_boot_outcomes metric) -- in other words the "aborted" line is growing twice as fast as the "successful" line.

I'm wondering if this is related to #20457 and some kind of churn at the top of the queue.

edit: I'm looking at the code and it looks like "aborted" might just mean the node was shut down intentionally, is that right? (this feels like a bad choice of terminology since "aborted" is usually used to mean terminating from an error condition).

I'm trying to understand why the numbers are out of balance, shouldn't there be 1 shutdown for every 1 successful startup?


Subtasks 1 (0 open1 closed)

Task #20549: Review 20511-aborted-bootResolvedTom Clegg05/25/2023

Actions
Actions #1

Updated by Peter Amstutz 20 days ago

  • Category set to Crunch
  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 20 days ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg 17 days ago

shouldn't there be 1 shutdown for every 1 successful startup?

"Boot outcome" is meant to track what happens after we create an instance:
  • succeeded -- boot probe succeeded
  • failed -- reached boot timeout before boot probe succeeded
  • aborted -- we decided we didn't need the instance and shut it down before either of the above happened
  • disappeared -- instance vanished in the cloud before any of the above happened

Each created instance counts toward exactly one of these metrics.

Actions #4

Updated by Tom Clegg 17 days ago

This does sound like it could be entirely caused by queue churn -- see #20457#note-30.

Actions #6

Updated by Peter Amstutz 14 days ago

  • Assigned To set to Tom Clegg
Actions #7

Updated by Peter Amstutz 10 days ago

  • Status changed from New to In Progress
Actions #8

Updated by Peter Amstutz 10 days ago

So the fix from https://dev.arvados.org/issues/20457#note-34 may have helped, I'm not sure.

However, I'm still seeing a lot of aborted bootups.

This is a problem for a couple of reasons:

  1. time/money wasted on requesting a node that doesn't get used
  2. time spent waiting to abort the shutdown & boot a new node when we could have potentially scheduled something else immediately on the node that is already booting/idle
  3. more churn in cloud API activity potentially means hitting cloud API throttling which slows everything down

The current design favors running higher priority containers over using available resources, but I think we should reconsider that, as we try to hit high throughput, it's potentially a lot of time being wasted.

Actions #9

Updated by Peter Amstutz 10 days ago

Suggested at standup this could a stats monitoring error, that something else like failed starts due to quota errors could be counted as aborted instances.

Actions #10

Updated by Peter Amstutz 9 days ago

So, after deploying a new dispatcher that prefers to run locked containers (20457-queue-churn), I am still seeing a very high number of "aborted" boot outcomes, almost a 2:1 ratio as reported earlier.

As mentioned, it's unclear if these are real aborted starts or a measurement error.

If they are real aborted starts, that's unfortunate, that's potentially a lot of time wasted. It is highly likely there is a lot of churn at the top of the queue, as there is both a large backlog of workflow runners waiting to run, but a large number of workflows that are also submitting new steps, and the steps for existing running workflows have priority over the workflows waiting to run.

Actions #11

Updated by Peter Amstutz 9 days ago

I looked at the run_queue code, here's a theory about what's happening.

  1. a bunch of nodes are in booting state
  2. we get a cloud quota error. we're now in "quota error" state
  3. these nodes are not booted yet, so they are not eligible to have containers run on them
  4. because we are in "over quota" state, all the "idle" workers are shut down -- but this includes both idle and booting!

So as a result, the booting nodes that would have been used, get killed off -- despite the fact the cloud clearly had capacity when those booting nodes were started

Thoughts:

  1. I don't think we want to kill those booting nodes as if they were idle, they were booted for a reason, I don't know the reasoning for this behavior, but it seems wrong.
  2. The idle nodes, if not mapped to a container, are certainly eligible for shutdown, but I think it would result in better throughput if it made more of an effort to find a container to run on those nodes -- there is a note about not starving lower-priority containers but I think throughput is more important than strict adherence to priority order. (Maybe we can balance it, like limit the lookahead to containers that are close in priority and not just take anything).
Actions #12

Updated by Tom Clegg 8 days ago

  • Target version changed from Development 2023-05-24 sprint to Development 2023-06-07
Actions #13

Updated by Peter Amstutz 8 days ago

I ran another test with "trace" log level.

  1. It looks like it is creating a transient 503 condition. It seems to have tried to lock 72 containers simultaneously, when the request limit is 64.
  2. This pins maxConcurrency at 787 (was 0 before, which I guess just means uncapped)
  3. next runQueue has Containers: 950, Processes: 788, maxConcurrency: 787
  4. Then we get "not starting: already at maxConcurrency"
  5. Then we get "unlock because pool capacity is used by higher priority containers" (but only unlocks 1 container apparently)
  6. Because maxConcurrency was exceeded, everything else in the queue is "overQuota"
  7. This also means everything that was "idle" gets shut down
  1. After a minute it tries to bump up maxConcurrency to 866
  2. It immediately sandbags the API server again with a surge of lock requests
  3. This causes it to readjust back down to 786, because it uses (running-1)

This makes me think something similar is happening on xngs1, each time it gets a 503 error back, all the idle nodes get deleted.

The peaks of "not allocated, over quota" are at 1 minute intervals, which seems to be the interval that it tries to increase maxConcurrency.

The one thing I still can't account for is how the spike in the "not allocated over quota" metric happens.

Actions #14

Updated by Peter Amstutz 8 days ago

Some additional thoughts

Clearly, the solution is to add a throttle to lock requests, and look over the code to make sure there isn't anything else the dispatcher could be doing that could result in an uncapped number of concurrent API requests

The conditions for this to happen seem to be some combination of container queue size (dispatcher can send out a certain number of simultaneous lock requests before overwhelming the API server request queue) and API server busy-ness with other work

This happens really quickly that it ends up being below the sampling rate for metrics, which made it hard to see

This possibly also explains the irregular surges in activity on xngs1 the other day, dispatcher could try to lock a bunch of containers, this would sometimes (but not always) overload the request queue, which is just enough to tip everything else into crisis state for a minute or two

Instead of aggressively shutting down nodes when we're in an overquota state, I think we should leave them and just let the idle timeout expire naturally -- if the overquota state relaxes soon enough, the nodes will be ready to go, instead of having been terminated only to be re-started 1 or 2 minutes later (and then having to wait 1 to 2 minutes to boot again, as well as drawing down our cloud API rate limiter for instance starts). The current behavior result in a yo-yo where the max capacity state relaxes, we start booting some nodes, then we get another 503 error, then all the nodes we just booted get killed, rinse and repeat. Leaving them up would at least slow down that cycle. (I think this is particularly egregious because the 503 error max capacity state is only 1 minute, if it were longer it might make more sense to delete nodes we are not going to use for several minutes).

This also seems like more evidence that we should add some kind of "surge protector" to controller so that a surge in activity on a specific API call or resource doesn't push out everything else. It might be useful to follow the model of AWS, which applies request rate limits to individual API requests.

Actions #15

Updated by Tom Clegg 7 days ago

20511-aborted-boot @ 40dcd44173ea6d6cdc53206766875151eac7b0fd -- developer-run-tests: #3663
  • arvados.Client request limiter increases capacity in max 10% increments up to double the known-achievable concurrency level (this was the intended behavior but an incorrect comparison prevented the 10% limit from working)
  • arvados.Client starts with a limit of 8 concurrent outgoing requests and raises it from there, rather than starting with unlimited concurrency until the first 503 error
  • When used by an internal service (like dispatcher), arvados.Client limits its outgoing request concurrency to 25% of the cluster's MaxConcurrentRequests config
  • arvados-dispatch-cloud does not shutdown instances or unlock containers based on the dynamic maxConcurrency limit. (It still shuts down excess instances if it can't start new instances due to provider quota or configured max instances, though, because in that case the excess instances are preventing it from creating the instances that are needed.)
Actions #16

Updated by Peter Amstutz 7 days ago

Tom Clegg wrote in #note-15:

20511-aborted-boot @ 40dcd44173ea6d6cdc53206766875151eac7b0fd -- developer-run-tests: #3663
  • arvados.Client request limiter increases capacity in max 10% increments up to double the known-achievable concurrency level (this was the intended behavior but an incorrect comparison prevented the 10% limit from working)
  • arvados.Client starts with a limit of 8 concurrent outgoing requests and raises it from there, rather than starting with unlimited concurrency until the first 503 error
  • When used by an internal service (like dispatcher), arvados.Client limits its outgoing request concurrency to 25% of the cluster's MaxConcurrentRequests config
  • arvados-dispatch-cloud does not shutdown instances or unlock containers based on the dynamic maxConcurrency limit. (It still shuts down excess instances if it can't start new instances due to provider quota or configured max instances, though, because in that case the excess instances are preventing it from creating the instances that are needed.)

This continues to treat "overmaxsuper" as the same as an "overquota" condition, which I don't think is right.

It does makes sense to unlock supervisors in this situation, however, I don't think it makes sense it shut down idle/booting nodes as if we were at cloud quota.

Actions #17

Updated by Peter Amstutz 7 days ago

In other words what I'm saying is the block for it, n := range unalloc should only be conditional on if len(overquota) > 0

Actions #18

Updated by Tom Clegg 7 days ago

Agreed, fixed.

While testing that fix I stumbled on another MaxSupervisors bug (introduced with the "queue churn" fix) that may help explain some things: we stopped counting supervisor containers against the maxsuper limit properly once they entered Running state. This would have allowed us to start too many supervisor processes if Queued supervisors had higher priority than Running ones. Fixed here.

20511-aborted-boot @ 924951b429e4311d51935555fa5605825622be0c -- developer-run-tests: #3668

Actions #19

Updated by Peter Amstutz 6 days ago

Tom Clegg wrote in #note-18:

Agreed, fixed.

While testing that fix I stumbled on another MaxSupervisors bug (introduced with the "queue churn" fix) that may help explain some things: we stopped counting supervisor containers against the maxsuper limit properly once they entered Running state. This would have allowed us to start too many supervisor processes if Queued supervisors had higher priority than Running ones. Fixed here.

20511-aborted-boot @ 924951b429e4311d51935555fa5605825622be0c -- developer-run-tests: #3668

Retried the FUSE test here developer-run-tests-services-fuse: #3714

That should be completely unrelated to these changes so I guess it's just bad luck?

Good call on sorting running containers before locked containers before queued containers.

This LGTM.

Actions #20

Updated by Peter Amstutz 2 days ago

  • Status changed from In Progress to Resolved
Actions #21

Updated by Peter Amstutz about 11 hours ago

  • Release set to 64
Actions

Also available in: Atom PDF