Bug #21023
closedGo keepclient retry does not wait between retries
Description
In sdk/go/keepclient getOrHead(), when a request fails with a retryable error, we don't wait between retries. We should use a randomized exponential backoff for congestion control similar to what we did with the API client.
This was found when looking at a program that talked to keep directly and made so many keepstore requests that it filled up the request queue, then started getting 503 errors. Because keepclient retried immediately instead of waiting, it did not give the overwhelmed keepstore a chance to drain its own queue and catch up.
Related issues
Updated by Peter Amstutz about 1 year ago
- Related to Feature #19972: Go arvados.Client retry with backoff added
Updated by Peter Amstutz almost 1 year ago
- Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Updated by Peter Amstutz 12 months ago
- Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2023-11-29 sprint to Development 2024-01-03 sprint
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Updated by Peter Amstutz 10 months ago
- Target version changed from Development 2024-02-14 sprint to Development 2024-01-31 sprint
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Updated by Tom Clegg 8 months ago
21023-keepclient-retry-delay @ 570d9570f6655a6e9ca4365173406e67dd0e0e31 -- developer-run-tests: #4034
21023-keepclient-retry-delay @ 570d9570f6655a6e9ca4365173406e67dd0e0e31 -- developer-run-tests: #4038
(CRAN is failing)
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Updated by Tom Clegg 8 months ago
21023-keepclient-retry-delay @ 22361307cf41f916afd562e7f33fcdaacefe5f9d -- developer-run-tests: #4041
- All agreed upon points are implemented / addressed.
- Adds delay between attempts to read or write a block
- Initial delay defaults to 2s, caller can override via KeepClient.RetryDelay (or keepclient.DefaultRetryDelay if they really want to be awful and change a global)
- After each attempt, delay increases by a random factor between 1.75x and 2x, but maxes out at 10x initial delay
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- n/a
- Code is tested and passing, both automated and manual, what manual testing was done is described
- ✅ Get/put retry tests are updated to check total time to do 3 retries for a few values of RetryDelay
- Documentation has been updated.
- New keepclient.KeepClient field is commented
- Behaves appropriately at the intended scale (describe intended scale).
- Expected to improve behavior at any given scale by a) not giving up as quickly during busy periods and b) thinning out thundering herds via randomized backoff.
- Considered backwards and forwards compatibility issues between client and server.
- ✅ n/a
- Follows our coding standards and GUI style guidelines.
- ✅
I updated the keepstore response codes so "full" cause 507 "insufficient storage" instead of 503. It's kind of tangential, but I noticed that the Python client considers 503 retryable and 507 non-retryable, which makes sense, and figured we should a) do that in the Go code as well, and b) have keepstore actually return that code when full.
keepproxy uses an initial retry delay of 1s. This is kind of arbitrary. Should it use the regular default 2s instead?
Updated by Brett Smith 8 months ago
Tom Clegg wrote in #note-17:
21023-keepclient-retry-delay @ 22361307cf41f916afd562e7f33fcdaacefe5f9d -- developer-run-tests: #4041
After each attempt, delay increases by a random factor between 1.75x and 2x, but maxes out at 10x initial delay
I believe the industry gold standard algorithm for this is "exponential backoff with jitter," which means "when you retry, double your maximum sleep time (up to a cap), then sleep a random time between 0 and your new maximum." Based on previous review, that's what AWS recommends, that's what CloudFlare implements in its Go backoff package, that's what Google implements in the Python googleapiclient.
This algorithm is not that. Where did this algorithm come from? Is there research on it? Is there some specific reason to believe it's better for us than the algorithm that multiple tech companies multiple orders of magnitude larger than us use for a wide variety of applications?
Because of different implementation languages, third-party libraries, etc., I don't expect we'll ever get the entire Arvados stack using the exact same retry algorithm. But I feel fairly strongly that we shouldn't invent our own without very clear justification.
- Follows our coding standards and GUI style guidelines.
- ✅
Can we please remove the "no inline comments" point in the style guide? Nobody likes it, nobody follows it (including this branch), all it does is offer cover for ignoring the rest of the style guide too.
I updated the keepstore response codes so "full" cause 507 "insufficient storage" instead of 503.
I like this cleanup, thank you.
keepproxy uses an initial retry delay of 1s. This is kind of arbitrary. Should it use the regular default 2s instead?
This sounds like a nice cleanup to me too. Thanks.
Updated by Brett Smith 8 months ago
Regarding "how to test with a zero minimum": isn't our usual approach in Go code to have a package variable time duration that we modify during testing? So it starts as the zero duration for regular use and the tests set it higher for measuring. That seems fine.
If that's not appropriate for whatever reason, a sufficiently tiny nonzero minimum also seems okay. Like even 10ms or something. I realize that's breaking the algorithm a bit but it seems much less than inventing our own scaling.
Updated by Tom Clegg 8 months ago
21023-keepclient-retry-delay @ bffc439a72f03126359d468329b8d25febc7bc9e -- developer-run-tests: #4045
- Use jitter algorithm described in #note-18
- MinimumRetryDelay defaults to 1 ms, and is modified to aid testing
- keepproxy just uses the default 2 s initial delay
- removed the "no inline comments" point from the style guide
Updated by Brett Smith 8 months ago
Tom Clegg wrote in #note-20:
21023-keepclient-retry-delay @ bffc439a72f03126359d468329b8d25febc7bc9e -- developer-run-tests: #4045
- Use jitter algorithm described in #note-18
- MinimumRetryDelay defaults to 1 ms, and is modified to aid testing
I'm not sure we needed to do both the options in my last comment, but it's fine.
This is good to merge. I have some readability nits, please take what you want, ignore the rest, and merge:
Commentf("elapsed %v / expect min %v", elapsed, min))
- I appreciate this comment. I think it could be a little more helpful if the wording was just a little more explicit, like maybe "elapsed %v (below|above) expected (min|max) %v"- It would be nice if the
checkInterval
assertions had the same comment. - In
TestDelayCalculator
, it would be nice if the different "paragraphs" were different test methods with names to describe what's being tested. That way the explanations in your comments actually show up in any failure logs. You could break out the anonymousdefer func
andcheckInterval
asStandaloneSuite
methods to keep those DRY.
Thanks.
Updated by Tom Clegg 8 months ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|cdbbd5b29a0f2056ef52f23489a29af6d116a94f.