Bug #19973
closedDispatcher responds to 503 errors by limiting container concurrency
Added by Tom Clegg almost 2 years ago. Updated almost 2 years ago.
Description
When controller/RailsAPI returns 503, it is quite likely the dispatcher itself is causing an overload condition, and retrying is counterproductive -- especially when the operation being retried is something like "lock", which, if it succeeds, will be followed by a lot more load on controller/RailsAPI from the ensuing instances / crunch-run processes.
Proposed mitigation:- when dispatcher sees a 503 response from controller/RailsAPI, it reduces by 1/2 or 3/4 its internal upper limit on how many concurrent containers it should try to run
- when dispatcher hasn't seen a 503 response for N seconds, it increases its upper limit by 1, up to a configured maximum
Updated by Tom Clegg almost 2 years ago
- Related to Feature #19972: Go arvados.Client retry with backoff added
Updated by Brett Smith almost 2 years ago
Between this and #19972, I think both are good to do, and I would personally prefer to do #19972 first. A challenge with this one is it seems like it would be difficult for the dispatcher to estimate how much API traffic any given container will generate: how many collections does it read? In serial or parallel? How much parallelism? Is it just going to create a bunch of child container requests right off the bat? Without information like this, I feel like it's going to be challenging to find a good way to throttle or limit container spawns. I can imagine certain workloads where the proposed mitigation doesn't improve cluster health, it just moves the overload problem from the dispatcher to the running containers.
Updated by Tom Clegg almost 2 years ago
- Related to Feature #19984: Go arvados.Client responds to 503 errors by limiting outgoing connection concurrency added
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Future to To be scheduled
Updated by Tom Clegg almost 2 years ago
Currently the dispatcher relentlessly schedules as many containers as it can, even when seeing 503 errors which are a nearly certain sign that scheduling more containers is counterproductive. That continues until it eventually empties the queue by exhausting the limit of lock/unlock cycles, each of which involves running crunch-run to failure on a cloud instance. That will still be true with #19972.
The idea here is that dispatcher can behave much better in this situation -- without being able to predict the railsapi load coming from a container -- just by not adding any new load at all during times when 503s are happening.
Updated by Peter Amstutz almost 2 years ago
- Target version changed from To be scheduled to 2023-02-15 sprint
- Assigned To set to Tom Clegg
Updated by Tom Clegg almost 2 years ago
- Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Updated by Tom Clegg almost 2 years ago
19973-dispatch-throttle @ bfb9e29c250bcfb34a6b1813ca46953503ca05e6 -- developer-run-tests: #3492
wb1 retry developer-run-tests-apps-workbench-integration: #3758
- SDK detects 503 responses and pauses/limits outgoing requests accordingly
- dispatchcloud notices whether the SDK has detected 503s lately, and adjusts its container concurrency limit accordingly
- in both cases, if there are never any 503s, concurrency is unlimited as before
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-10:
19973-dispatch-throttle @ bfb9e29c250bcfb34a6b1813ca46953503ca05e6 -- developer-run-tests: #3492
The branch looks good to me, thanks. I have a couple of questions that stem from me being a Go newbie, not branch comments. If you have a few minutes to help with these I'd appreciate it, if not no big deal.
- It took me embarrassingly long to figure out
Client.requestLimiter
had both name and typerequestLimiter
. (I got unlucky that thelimiter.go
came afterclient.go
in the diff.) I've seen tuple-ish structs that only list their contained types, but not a construction like this where a short declaration appears in the middle of full name+type declarations. What's this called? - What's the reasoning behind writing tests in multiple blocks in a single method over multiple test methods? I can imagine why you might want to do it if you do a complicated bit of one-time setup that all the test blocks use. IMO the setup here isn't anywhere close to that, but YMMV.
Updated by Brett Smith almost 2 years ago
Note for posterity for any future readers who are doing code archaeology and reading this ticket: the branch does not implement the mitigation in the description. Instead:
- When a request reports a 503, concurrent requests are limited to the current number of concurrent requests - 1.
- When a request reports success (200..400, though note http.Client should automatically handle 3xx requests so we don't expect those to be reported), the limit is increased 10%, capped at (2 * the maximum number of requests without a 503).
This is expected and understood.
Updated by Tom Clegg almost 2 years ago
Brett Smith wrote in #note-13:
- When a request reports a 503, concurrent requests are limited to the current number of concurrent requests - 1.
...to the current concurrency limit ÷ 2, or (on the first 503 when there is no concurrency limit in effect) the current number of concurrent requests ÷ 2.
Also, for the sake of completeness,
During the 1s interval immediately following a 503 response,- any additional 503 responses do not reduce the concurrency limit any further
- any new outgoing requests are delayed until the 1s interval ends
IOW, upon seeing any number of 503 responses in quick succession, we pause and reduce the concurrency limit by half.
Updated by Tom Clegg almost 2 years ago
Brett Smith wrote in #note-12:
- It took me embarrassingly long to figure out
Client.requestLimiter
had both name and typerequestLimiter
. (I got unlucky that thelimiter.go
came afterclient.go
in the diff.) I've seen tuple-ish structs that only list their contained types, but not a construction like this where a short declaration appears in the middle of full name+type declarations. What's this called?
It's an embedded struct. Nearly the same as saying "requestLimiter requestLimiter" with the additional behavior that the requestLimiter's methods also get attached to Client.
I didn't actually intend to export the Acquire/Release/Report methods to external users of Client, though. Fixed.
- What's the reasoning behind writing tests in multiple blocks in a single method over multiple test methods? I can imagine why you might want to do it if you do a complicated bit of one-time setup that all the test blocks use. IMO the setup here isn't anywhere close to that, but YMMV.
I was thinking of it as taking a single requestLimiter through a sequence of different weather patterns. But there didn't end up being any useful difference between that and separate tests. So I've rearranged it as separate tests.
19973-dispatch-throttle @ 87380299e967cf8921ffbfd98c52d0f2a13613e1 -- developer-run-tests: #3506
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-15:
19973-dispatch-throttle @ 87380299e967cf8921ffbfd98c52d0f2a13613e1 -- developer-run-tests: #3506
Thanks for taking the extra time, this looks even better. Please merge.
Updated by Tom Clegg almost 2 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|7c013f919b5db9336833dbe855349600449a993d.