Project

General

Profile

Actions

Feature #19972

closed

Go arvados.Client retry with backoff

Added by Peter Amstutz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Story points:
2.0
Release relationship:
Auto

Description

arvados.Client currently does not perform any retry behavior when a request fails with a 5xx error.

5xx errors should be retried with a random backoff time. Possible behavior (Fibonacci sequence):

1st attempt wait 0-1 seconds
2nd attempt, wait 1-2 seconds
3nd attempt, wait 2-3 seconds
4nd attempt, wait 3-5 seconds
5th attempt, wait 5-8 seconds

Another approach is randomized exponential backoff. As it happens, this is what the Google Python client does:

1st attempt wait 0-1 seconds
2nd attempt, wait 0-2 seconds
3nd attempt, wait 0-4 seconds
4nd attempt, wait 0-8 seconds
5th attempt, wait 0-16 seconds

(in this case, it probably makes sense to set a ceiling in terms of amount of time spent retrying rather than the number of retries).

The idea being that if the API server is overloaded, the clients should all select different retry intervals so that they don't just flood the API server by all retrying at a fixed time.

The caller should be able to control the retry behavior, e.g.,
  • client.RetryWindow = time.Minute means return an error if the request is still failing after 1 minute of retrying on the prevailing schedule
  • client.RetryWindow = 0 means don't retry automatically

A request that uses a caller-provided context should also make sure to notice and return early when the context cancels during a retry delay.


Subtasks 1 (0 open1 closed)

Task #20044: Review 19972-go-client-retryResolvedTom Clegg03/08/2023Actions

Related issues

Related to Arvados - Bug #19973: Dispatcher responds to 503 errors by limiting container concurrencyResolvedTom Clegg02/16/2023Actions
Related to Arvados - Feature #19984: Go arvados.Client responds to 503 errors by limiting outgoing connection concurrencyResolvedTom Clegg02/21/2023Actions
Related to Arvados - Idea #20107: Research retry strategies when SDK API calls return 5xx errorsNewBrett SmithActions
Related to Arvados - Bug #21023: Go keepclient retry does not wait between retriesResolvedTom Clegg02/13/2024Actions
Actions #1

Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 1 year ago

  • Related to Bug #19973: Dispatcher responds to 503 errors by limiting container concurrency added
Actions #3

Updated by Lucas Di Pentima over 1 year ago

Probably we could reuse a preexisting library like https://pkg.go.dev/github.com/cloudflare/backoff

Actions #4

Updated by Peter Amstutz over 1 year ago

It's a pretty simple behavior to pull a whole module for (but arguably better than duplicating code). Yes, exponential backoff with jitter algorithm that is described there seems like what we want.

Actions #5

Updated by Peter Amstutz over 1 year ago

  • Story points set to 2.0
  • Target version changed from Future to To be scheduled
Actions #6

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #7

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Target version changed from To be scheduled to 2023-02-15 sprint
Actions #9

Updated by Peter Amstutz about 1 year ago

  • Release set to 57
Actions #10

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Tom Clegg
Actions #11

Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Actions #12

Updated by Peter Amstutz about 1 year ago

  • Assigned To deleted (Tom Clegg)
Actions #13

Updated by Tom Clegg about 1 year ago

  • Related to Feature #19984: Go arvados.Client responds to 503 errors by limiting outgoing connection concurrency added
Actions #14

Updated by Peter Amstutz about 1 year ago

  • Tracker changed from Bug to Feature
Actions #15

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Tom Clegg
Actions #16

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #17

Updated by Brett Smith about 1 year ago

  • Related to Idea #20107: Research retry strategies when SDK API calls return 5xx errors added
Actions #18

Updated by Tom Clegg about 1 year ago

  • Status changed from New to In Progress
Actions #19

Updated by Tom Clegg about 1 year ago

19972-go-client-retry @ 9baa7be4c57756cfef3a19139edbc54d398340e0 -- developer-run-tests: #3513

This uses the existing Timeout field as the maximum retry window. Arguably this is how the available time should be used -- if the application is willing to wait that long for a response, it probably doesn't care exactly where in the stack the waiting/retrying is happening, so existing code that uses Timeout probably wants this -- but OTOH it seems somewhat obfuscating. Is it important to let clients choose a shorter timeout for giving up on a single request but also auto-hangup-and-retry a request that takes longer?

TODO:
  • Update dispatcher (and other components as needed) to disable auto-retry
Actions #20

Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2023-03-01 sprint to Development 2023-03-15 sprint
Actions #22

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-21:

19972-go-client-retry @ 4aaff900e146c20c9e586223823e8661ce40d7f8 -- developer-run-tests: #3541

This was a lot less invasive than I was worried it would be, and I really appreciate the design of the library you found. I'm sure it took time to make it this straightforward, thank you.

While I do like the library, it's under the MPL 2.0, which has some copyleft-ish requirements unlike the Apache License 2.0 we use for SDKs. I don't think there's a compatibility problem here, but it does put some additional requirements on us and other users of the SDK beyond the Apache License 2.0. As a matter of policy, are we okay with that?

In client.go, both shouldRetry and checkRetry seem to be unused. Am I missing something? Can we remove them?

Most of the "disable retry" lines look like:

        client.Timeout = 0 // disable auto-retry

… but both our coding guidelines and the Go style linked from there prefer separate comment sentences over inline comments. I think you wrote those guidelines, though, so I'm not gonna insist.

Not a review comment but a note for future readers: go-retryablehttpd uses exponential backoff by default, but you can configure the wait strategy using the Backoff field. If the outcome of #20107 is that we'd rather use jitter, that's implemented in the library too. We just switch the Backoff field to LinearJitterBackoff with MinWait=0.

Actions #23

Updated by Tom Clegg about 1 year ago

Brett Smith wrote in #note-22:

the library you found.

I think Lucas found it... but yes, it did work out pretty well.

While I do like the library, it's under the MPL 2.0 [...] are we okay with that?

(From discussion) given that we already have other hashicorp MPL deps I think the answer is yes. (Plus, separately from this, we should continue to stay on top of licensing issues.)

In client.go, both shouldRetry and checkRetry seem to be unused. Am I missing something? Can we remove them?

Oops, yes. They were earlier versions of what's now in the retryablehttp library and in the inline func in Do. Removed.

prefer separate comment sentences over inline comments.

I personally think the inline form makes more sense for those mini comments, but indeed, the style guide currently says no, so no. Fixed.

19972-go-client-retry @ 47d1bc31bf9ee554665881c562ca2dea9acf9e22 -- developer-run-tests: #3548

Actions #24

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-23:

I personally think the inline form makes more sense for those mini comments

I agree, so I'm game to revise the coding guidelines if you are.

19972-go-client-retry @ 47d1bc31bf9ee554665881c562ca2dea9acf9e22 -- developer-run-tests: #3548

The changes look good to me, but that failed keep-balance test looks like it could actually be relevant? If it's just flaky, please add it to #20208, then go ahead and merge this. Thanks.

Actions #26

Updated by Tom Clegg about 1 year ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #27

Updated by Peter Amstutz 7 months ago

  • Related to Bug #21023: Go keepclient retry does not wait between retries added
Actions

Also available in: Atom PDF