Feature #19972
closedGo arvados.Client retry with backoff
Description
arvados.Client currently does not perform any retry behavior when a request fails with a 5xx error.
5xx errors should be retried with a random backoff time. Possible behavior (Fibonacci sequence):
1st attempt wait 0-1 seconds
2nd attempt, wait 1-2 seconds
3nd attempt, wait 2-3 seconds
4nd attempt, wait 3-5 seconds
5th attempt, wait 5-8 seconds
Another approach is randomized exponential backoff. As it happens, this is what the Google Python client does:
1st attempt wait 0-1 seconds
2nd attempt, wait 0-2 seconds
3nd attempt, wait 0-4 seconds
4nd attempt, wait 0-8 seconds
5th attempt, wait 0-16 seconds
(in this case, it probably makes sense to set a ceiling in terms of amount of time spent retrying rather than the number of retries).
The idea being that if the API server is overloaded, the clients should all select different retry intervals so that they don't just flood the API server by all retrying at a fixed time.
The caller should be able to control the retry behavior, e.g.,client.RetryWindow = time.Minute
means return an error if the request is still failing after 1 minute of retrying on the prevailing scheduleclient.RetryWindow = 0
means don't retry automatically
A request that uses a caller-provided context should also make sure to notice and return early when the context cancels during a retry delay.
Updated by Tom Clegg almost 2 years ago
- Related to Bug #19973: Dispatcher responds to 503 errors by limiting container concurrency added
Updated by Lucas Di Pentima almost 2 years ago
Probably we could reuse a preexisting library like https://pkg.go.dev/github.com/cloudflare/backoff
Updated by Peter Amstutz almost 2 years ago
It's a pretty simple behavior to pull a whole module for (but arguably better than duplicating code). Yes, exponential backoff with jitter algorithm that is described there seems like what we want.
Updated by Peter Amstutz almost 2 years ago
- Story points set to 2.0
- Target version changed from Future to To be scheduled
Updated by Peter Amstutz almost 2 years ago
- Target version changed from To be scheduled to 2023-02-15 sprint
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Updated by Tom Clegg almost 2 years ago
- Related to Feature #19984: Go arvados.Client responds to 503 errors by limiting outgoing connection concurrency added
Updated by Peter Amstutz almost 2 years ago
- Tracker changed from Bug to Feature
Updated by Brett Smith almost 2 years ago
- Related to Idea #20107: Research retry strategies when SDK API calls return 5xx errors added
Updated by Tom Clegg almost 2 years ago
19972-go-client-retry @ 9baa7be4c57756cfef3a19139edbc54d398340e0 -- developer-run-tests: #3513
This uses the existing Timeout
field as the maximum retry window. Arguably this is how the available time should be used -- if the application is willing to wait that long for a response, it probably doesn't care exactly where in the stack the waiting/retrying is happening, so existing code that uses Timeout
probably wants this -- but OTOH it seems somewhat obfuscating. Is it important to let clients choose a shorter timeout for giving up on a single request but also auto-hangup-and-retry a request that takes longer?
- Update dispatcher (and other components as needed) to disable auto-retry
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2023-03-01 sprint to Development 2023-03-15 sprint
Updated by Tom Clegg almost 2 years ago
19972-go-client-retry @ 4aaff900e146c20c9e586223823e8661ce40d7f8 -- developer-run-tests: #3541
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-21:
19972-go-client-retry @ 4aaff900e146c20c9e586223823e8661ce40d7f8 -- developer-run-tests: #3541
This was a lot less invasive than I was worried it would be, and I really appreciate the design of the library you found. I'm sure it took time to make it this straightforward, thank you.
While I do like the library, it's under the MPL 2.0, which has some copyleft-ish requirements unlike the Apache License 2.0 we use for SDKs. I don't think there's a compatibility problem here, but it does put some additional requirements on us and other users of the SDK beyond the Apache License 2.0. As a matter of policy, are we okay with that?
In client.go
, both shouldRetry
and checkRetry
seem to be unused. Am I missing something? Can we remove them?
Most of the "disable retry" lines look like:
client.Timeout = 0 // disable auto-retry
… but both our coding guidelines and the Go style linked from there prefer separate comment sentences over inline comments. I think you wrote those guidelines, though, so I'm not gonna insist.
Not a review comment but a note for future readers: go-retryablehttpd uses exponential backoff by default, but you can configure the wait strategy using the Backoff
field. If the outcome of #20107 is that we'd rather use jitter, that's implemented in the library too. We just switch the Backoff
field to LinearJitterBackoff
with MinWait=0
.
Updated by Tom Clegg almost 2 years ago
Brett Smith wrote in #note-22:
the library you found.
I think Lucas found it... but yes, it did work out pretty well.
While I do like the library, it's under the MPL 2.0 [...] are we okay with that?
(From discussion) given that we already have other hashicorp MPL deps I think the answer is yes. (Plus, separately from this, we should continue to stay on top of licensing issues.)
In
client.go
, bothshouldRetry
andcheckRetry
seem to be unused. Am I missing something? Can we remove them?
Oops, yes. They were earlier versions of what's now in the retryablehttp library and in the inline func in Do. Removed.
prefer separate comment sentences over inline comments.
I personally think the inline form makes more sense for those mini comments, but indeed, the style guide currently says no, so no. Fixed.
19972-go-client-retry @ 47d1bc31bf9ee554665881c562ca2dea9acf9e22 -- developer-run-tests: #3548
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-23:
I personally think the inline form makes more sense for those mini comments
I agree, so I'm game to revise the coding guidelines if you are.
19972-go-client-retry @ 47d1bc31bf9ee554665881c562ca2dea9acf9e22 -- developer-run-tests: #3548
The changes look good to me, but that failed keep-balance test looks like it could actually be relevant? If it's just flaky, please add it to #20208, then go ahead and merge this. Thanks.
Updated by Tom Clegg almost 2 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|f94ac6e8ad9aec3c781cd71b72fcc5e2c1cedd8d.
Updated by Peter Amstutz over 1 year ago
- Related to Bug #21023: Go keepclient retry does not wait between retries added