Project

General

Profile

Actions

Bug #21023

closed

Go keepclient retry does not wait between retries

Added by Peter Amstutz 7 months ago. Updated 6 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Story points:
2.0
Release:
Release relationship:
Auto

Description

In sdk/go/keepclient getOrHead(), when a request fails with a retryable error, we don't wait between retries. We should use a randomized exponential backoff for congestion control similar to what we did with the API client.

This was found when looking at a program that talked to keep directly and made so many keepstore requests that it filled up the request queue, then started getting 503 errors. Because keepclient retried immediately instead of waiting, it did not give the overwhelmed keepstore a chance to drain its own queue and catch up.


Subtasks 1 (0 open1 closed)

Task #21404: Review 21023-keepclient-retry-delayResolvedTom Clegg02/13/2024Actions

Related issues

Related to Arvados - Feature #19972: Go arvados.Client retry with backoffResolvedTom Clegg03/08/2023Actions
Actions #1

Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 7 months ago

  • Related to Feature #19972: Go arvados.Client retry with backoff added
Actions #3

Updated by Peter Amstutz 7 months ago

  • Story points set to 2.0
Actions #4

Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Actions #6

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Actions #7

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2023-11-29 sprint to Development 2024-01-03 sprint
Actions #8

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Actions #9

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #10

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Actions #11

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-01-31 sprint
Actions #12

Updated by Peter Amstutz 4 months ago

  • Assigned To set to Tom Clegg
Actions #13

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Actions #14

Updated by Tom Clegg 3 months ago

Actions #15

Updated by Tom Clegg 3 months ago

  • Status changed from New to In Progress
Actions #16

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #17

Updated by Tom Clegg 3 months ago

21023-keepclient-retry-delay @ 22361307cf41f916afd562e7f33fcdaacefe5f9d -- developer-run-tests: #4041

  • All agreed upon points are implemented / addressed.
    • Adds delay between attempts to read or write a block
    • Initial delay defaults to 2s, caller can override via KeepClient.RetryDelay (or keepclient.DefaultRetryDelay if they really want to be awful and change a global)
    • After each attempt, delay increases by a random factor between 1.75x and 2x, but maxes out at 10x initial delay
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • ✅ Get/put retry tests are updated to check total time to do 3 retries for a few values of RetryDelay
  • Documentation has been updated.
    • New keepclient.KeepClient field is commented
  • Behaves appropriately at the intended scale (describe intended scale).
    • Expected to improve behavior at any given scale by a) not giving up as quickly during busy periods and b) thinning out thundering herds via randomized backoff.
  • Considered backwards and forwards compatibility issues between client and server.
    • ✅ n/a
  • Follows our coding standards and GUI style guidelines.

I updated the keepstore response codes so "full" cause 507 "insufficient storage" instead of 503. It's kind of tangential, but I noticed that the Python client considers 503 retryable and 507 non-retryable, which makes sense, and figured we should a) do that in the Go code as well, and b) have keepstore actually return that code when full.

keepproxy uses an initial retry delay of 1s. This is kind of arbitrary. Should it use the regular default 2s instead?

Actions #18

Updated by Brett Smith 3 months ago

Tom Clegg wrote in #note-17:

21023-keepclient-retry-delay @ 22361307cf41f916afd562e7f33fcdaacefe5f9d -- developer-run-tests: #4041

After each attempt, delay increases by a random factor between 1.75x and 2x, but maxes out at 10x initial delay

I believe the industry gold standard algorithm for this is "exponential backoff with jitter," which means "when you retry, double your maximum sleep time (up to a cap), then sleep a random time between 0 and your new maximum." Based on previous review, that's what AWS recommends, that's what CloudFlare implements in its Go backoff package, that's what Google implements in the Python googleapiclient.

This algorithm is not that. Where did this algorithm come from? Is there research on it? Is there some specific reason to believe it's better for us than the algorithm that multiple tech companies multiple orders of magnitude larger than us use for a wide variety of applications?

Because of different implementation languages, third-party libraries, etc., I don't expect we'll ever get the entire Arvados stack using the exact same retry algorithm. But I feel fairly strongly that we shouldn't invent our own without very clear justification.

  • Follows our coding standards and GUI style guidelines.

Can we please remove the "no inline comments" point in the style guide? Nobody likes it, nobody follows it (including this branch), all it does is offer cover for ignoring the rest of the style guide too.

I updated the keepstore response codes so "full" cause 507 "insufficient storage" instead of 503.

I like this cleanup, thank you.

keepproxy uses an initial retry delay of 1s. This is kind of arbitrary. Should it use the regular default 2s instead?

This sounds like a nice cleanup to me too. Thanks.

Actions #19

Updated by Brett Smith 3 months ago

Regarding "how to test with a zero minimum": isn't our usual approach in Go code to have a package variable time duration that we modify during testing? So it starts as the zero duration for regular use and the tests set it higher for measuring. That seems fine.

If that's not appropriate for whatever reason, a sufficiently tiny nonzero minimum also seems okay. Like even 10ms or something. I realize that's breaking the algorithm a bit but it seems much less than inventing our own scaling.

Actions #20

Updated by Tom Clegg 3 months ago

21023-keepclient-retry-delay @ bffc439a72f03126359d468329b8d25febc7bc9e -- developer-run-tests: #4045

  • Use jitter algorithm described in #note-18
  • MinimumRetryDelay defaults to 1 ms, and is modified to aid testing
  • keepproxy just uses the default 2 s initial delay
  • removed the "no inline comments" point from the style guide
Actions #21

Updated by Brett Smith 3 months ago

Tom Clegg wrote in #note-20:

21023-keepclient-retry-delay @ bffc439a72f03126359d468329b8d25febc7bc9e -- developer-run-tests: #4045

  • Use jitter algorithm described in #note-18
  • MinimumRetryDelay defaults to 1 ms, and is modified to aid testing

I'm not sure we needed to do both the options in my last comment, but it's fine.

This is good to merge. I have some readability nits, please take what you want, ignore the rest, and merge:

  • Commentf("elapsed %v / expect min %v", elapsed, min)) - I appreciate this comment. I think it could be a little more helpful if the wording was just a little more explicit, like maybe "elapsed %v (below|above) expected (min|max) %v"
  • It would be nice if the checkInterval assertions had the same comment.
  • In TestDelayCalculator, it would be nice if the different "paragraphs" were different test methods with names to describe what's being tested. That way the explanations in your comments actually show up in any failure logs. You could break out the anonymous defer func and checkInterval as StandaloneSuite methods to keep those DRY.

Thanks.

Actions #22

Updated by Tom Clegg 3 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #23

Updated by Peter Amstutz 6 days ago

  • Release set to 70
Actions

Also available in: Atom PDF