Project

General

Profile

Actions

Bug #19973

closed

Dispatcher responds to 503 errors by limiting container concurrency

Added by Tom Clegg about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0
Release relationship:
Auto

Description

When controller/RailsAPI returns 503, it is quite likely the dispatcher itself is causing an overload condition, and retrying is counterproductive -- especially when the operation being retried is something like "lock", which, if it succeeds, will be followed by a lot more load on controller/RailsAPI from the ensuing instances / crunch-run processes.

Proposed mitigation:
  • when dispatcher sees a 503 response from controller/RailsAPI, it reduces by 1/2 or 3/4 its internal upper limit on how many concurrent containers it should try to run
  • when dispatcher hasn't seen a 503 response for N seconds, it increases its upper limit by 1, up to a configured maximum

Subtasks 1 (0 open1 closed)

Task #20114: Review 19973-dispatch-throttleResolvedTom Clegg02/16/2023Actions

Related issues

Related to Arvados - Feature #19972: Go arvados.Client retry with backoffResolvedTom Clegg03/08/2023Actions
Related to Arvados - Feature #19984: Go arvados.Client responds to 503 errors by limiting outgoing connection concurrencyResolvedTom Clegg02/21/2023Actions
Actions #1

Updated by Tom Clegg about 1 year ago

  • Related to Feature #19972: Go arvados.Client retry with backoff added
Actions #2

Updated by Brett Smith about 1 year ago

Between this and #19972, I think both are good to do, and I would personally prefer to do #19972 first. A challenge with this one is it seems like it would be difficult for the dispatcher to estimate how much API traffic any given container will generate: how many collections does it read? In serial or parallel? How much parallelism? Is it just going to create a bunch of child container requests right off the bat? Without information like this, I feel like it's going to be challenging to find a good way to throttle or limit container spawns. I can imagine certain workloads where the proposed mitigation doesn't improve cluster health, it just moves the overload problem from the dispatcher to the running containers.

Actions #3

Updated by Tom Clegg about 1 year ago

  • Related to Feature #19984: Go arvados.Client responds to 503 errors by limiting outgoing connection concurrency added
Actions #4

Updated by Tom Clegg about 1 year ago

  • Story points set to 1.0
Actions #5

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Future to To be scheduled
Actions #6

Updated by Tom Clegg about 1 year ago

Currently the dispatcher relentlessly schedules as many containers as it can, even when seeing 503 errors which are a nearly certain sign that scheduling more containers is counterproductive. That continues until it eventually empties the queue by exhausting the limit of lock/unlock cycles, each of which involves running crunch-run to failure on a cloud instance. That will still be true with #19972.

The idea here is that dispatcher can behave much better in this situation -- without being able to predict the railsapi load coming from a container -- just by not adding any new load at all during times when 503s are happening.

Actions #7

Updated by Peter Amstutz about 1 year ago

  • Target version changed from To be scheduled to 2023-02-15 sprint
  • Assigned To set to Tom Clegg
Actions #8

Updated by Peter Amstutz about 1 year ago

  • Release set to 57
Actions #9

Updated by Tom Clegg about 1 year ago

  • Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Actions #10

Updated by Tom Clegg about 1 year ago

19973-dispatch-throttle @ bfb9e29c250bcfb34a6b1813ca46953503ca05e6 -- developer-run-tests: #3492
wb1 retry developer-run-tests-apps-workbench-integration: #3758

This includes #19984.
  • SDK detects 503 responses and pauses/limits outgoing requests accordingly
  • dispatchcloud notices whether the SDK has detected 503s lately, and adjusts its container concurrency limit accordingly
  • in both cases, if there are never any 503s, concurrency is unlimited as before
Actions #11

Updated by Tom Clegg about 1 year ago

  • Status changed from New to In Progress
Actions #12

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-10:

19973-dispatch-throttle @ bfb9e29c250bcfb34a6b1813ca46953503ca05e6 -- developer-run-tests: #3492

The branch looks good to me, thanks. I have a couple of questions that stem from me being a Go newbie, not branch comments. If you have a few minutes to help with these I'd appreciate it, if not no big deal.

  • It took me embarrassingly long to figure out Client.requestLimiter had both name and type requestLimiter. (I got unlucky that the limiter.go came after client.go in the diff.) I've seen tuple-ish structs that only list their contained types, but not a construction like this where a short declaration appears in the middle of full name+type declarations. What's this called?
  • What's the reasoning behind writing tests in multiple blocks in a single method over multiple test methods? I can imagine why you might want to do it if you do a complicated bit of one-time setup that all the test blocks use. IMO the setup here isn't anywhere close to that, but YMMV.
Actions #13

Updated by Brett Smith about 1 year ago

Note for posterity for any future readers who are doing code archaeology and reading this ticket: the branch does not implement the mitigation in the description. Instead:

  • When a request reports a 503, concurrent requests are limited to the current number of concurrent requests - 1.
  • When a request reports success (200..400, though note http.Client should automatically handle 3xx requests so we don't expect those to be reported), the limit is increased 10%, capped at (2 * the maximum number of requests without a 503).

This is expected and understood.

Actions #14

Updated by Tom Clegg about 1 year ago

Brett Smith wrote in #note-13:

  • When a request reports a 503, concurrent requests are limited to the current number of concurrent requests - 1.

...to the current concurrency limit ÷ 2, or (on the first 503 when there is no concurrency limit in effect) the current number of concurrent requests ÷ 2.

Also, for the sake of completeness,

During the 1s interval immediately following a 503 response,
  • any additional 503 responses do not reduce the concurrency limit any further
  • any new outgoing requests are delayed until the 1s interval ends

IOW, upon seeing any number of 503 responses in quick succession, we pause and reduce the concurrency limit by half.

Actions #15

Updated by Tom Clegg about 1 year ago

Brett Smith wrote in #note-12:

  • It took me embarrassingly long to figure out Client.requestLimiter had both name and type requestLimiter. (I got unlucky that the limiter.go came after client.go in the diff.) I've seen tuple-ish structs that only list their contained types, but not a construction like this where a short declaration appears in the middle of full name+type declarations. What's this called?

It's an embedded struct. Nearly the same as saying "requestLimiter requestLimiter" with the additional behavior that the requestLimiter's methods also get attached to Client.

I didn't actually intend to export the Acquire/Release/Report methods to external users of Client, though. Fixed.

  • What's the reasoning behind writing tests in multiple blocks in a single method over multiple test methods? I can imagine why you might want to do it if you do a complicated bit of one-time setup that all the test blocks use. IMO the setup here isn't anywhere close to that, but YMMV.

I was thinking of it as taking a single requestLimiter through a sequence of different weather patterns. But there didn't end up being any useful difference between that and separate tests. So I've rearranged it as separate tests.

19973-dispatch-throttle @ 87380299e967cf8921ffbfd98c52d0f2a13613e1 -- developer-run-tests: #3506

Actions #16

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-15:

19973-dispatch-throttle @ 87380299e967cf8921ffbfd98c52d0f2a13613e1 -- developer-run-tests: #3506

Thanks for taking the extra time, this looks even better. Please merge.

Actions #17

Updated by Tom Clegg about 1 year ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF