Project

General

Profile

Actions

Story #20107

open

Research retry strategies when SDK API calls return 5xx errors

Added by Peter Amstutz about 1 year ago. Updated 9 months ago.

Status:
New
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Related issues

Related to Arvados - Bug #12684: Let user specify a retry strategy on the client object, used for all API callsResolvedBrett Smith05/09/2023

Actions
Related to Arvados - Feature #19972: Go arvados.Client retry with backoffResolvedTom Clegg03/08/2023

Actions
Related to Arvados - Story #8539: [SDKs/FUSE] Better retry defaultsResolved

Actions
Actions #1

Updated by Peter Amstutz about 1 year ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 1 year ago

  • Status changed from In Progress to New
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Tom Clegg
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Assigned To changed from Tom Clegg to Brett Smith
Actions #5

Updated by Brett Smith about 1 year ago

https://pkg.go.dev/github.com/cloudflare/backoff is one starting point, along with its linked reading.

Actions #6

Updated by Brett Smith about 1 year ago

The blog post is from AWS. The fact that AWS wants its customers to do this, and CloudFlare likes it enough to implement, suggests it should be good enough for us.

The blog post does not give any concrete numbers for time durations. It gives formulas for calculating waits, but never defines or suggests initial values. The main choice is between what it calls "full jitter":

sleep = random_between(0, min(cap, base * 2 ** attempt))

and "decorrelated jitter":

sleep = min(cap, random_between(base, sleep * 3))

See the two graphs at the bottom for how these compare on time and server load. For what it's worth, CloudFlare's backoff package implements "full jitter."

CloudFlare's package does have default starting values: 5 minutes for the default interval, and 6 hours for the longest it will wait. But there's no background for what applications these numbers are intended for, or how they came up with these numbers.

It seems to me there's a clear trade-off between responsiveness and likelihood of success. Different Arvados clients might even have different needs here: Workbench generally wants as much responsiveness as it can practically get, while Crunch will almost always prioritize eventual success over any level of responsiveness, especially after a long-run, expensive compute job.

With all that in mind, I think my starting suggestions would be:

  • Because different clients will have different priorities, it seems best if the SDKs allow users to tune these parameters themselves.
  • Tuning the timing parameters of step interval and max wait seems more useful than the static "number of retries."
  • Most of our client tools tend to prioritize success over responsiveness, so I think default timing parameters more like CloudFlare's are probably better defaults for our SDKs than the Python client's single-digit seconds. As a starting idea without any experimentation, I think I would suggest at least 10 seconds for the default interval, probably more like 30, and maybe up to a minute or two. Five minutes feels really high for what we're doing.
Actions #7

Updated by Tom Clegg about 1 year ago

An extra thought: It might be useful to allow the caller to adjust max-wait-time over time. For example, in crunch-run and its fuse mount, 15s after startup, max=30s might be a good choice, but 12h later, max=5m or even 1h might be more appropriate. OTOH, in the worst case, having a a bunch of 12h-old containers each waiting 1h for each call to an overloaded API server could waste a lot of money on cloud instances.

Actions #8

Updated by Brett Smith about 1 year ago

  • Tracker changed from Bug to Story

I agree, but we need to be careful talking about "max wait" because there's two kinds. There's a cap on the maximum time to sleep between retries (literally cap in the formulas), and then, optionally, a cap on the total time you spend sleeping before you give up and the operation returns an error.

I think it would be good for the SDKs to let you set both. I think the really hard question for us is, what should the default "give up" time be, and specifically, should it be never?

One way to think about that question: in the scenario you posed where lots of expensive compute nodes are idling waiting for the API server to come back, what are the odds that giving up and requiring a rerun is going to be cheaper than continuing to sleep and hope for the best? I'm starting to come around to thinking those odds are much lower than we've thought in the past.

Actions #9

Updated by Peter Amstutz about 1 year ago

Brett Smith wrote in #note-8:

One way to think about that question: in the scenario you posed where lots of expensive compute nodes are idling waiting for the API server to come back, what are the odds that giving up and requiring a rerun is going to be cheaper than continuing to sleep and hope for the best? I'm starting to come around to thinking those odds are much lower than we've thought in the past.

Just to make sure I'm reading this right: you are saying that you think the odds are low that giving up and requiring a rerun will be cheaper than waiting. correct?

I feel like we at least discussed, if not already implemented, a strategy for crunch-run where the time it would wait would be proportional to the amount of time already spent, before giving up. So the more time already invested, the longer it is worth it to wait and see if services come back.

Actions #10

Updated by Brett Smith about 1 year ago

Peter Amstutz wrote in #note-9:

Just to make sure I'm reading this right: you are saying that you think the odds are low that giving up and requiring a rerun will be cheaper than waiting. correct?

Right.

Actions #11

Updated by Brett Smith about 1 year ago

  • Related to Bug #12684: Let user specify a retry strategy on the client object, used for all API calls added
Actions #12

Updated by Brett Smith about 1 year ago

  • Related to Feature #19972: Go arvados.Client retry with backoff added
Actions #13

Updated by Brett Smith 12 months ago

Cross-posting from #19972 for the Go SDK: go-retryablehttpd uses exponential backoff by default, but you can configure the wait strategy using the Backoff field. If the outcome of this ticket is that we'd rather use jitter, that's implemented in the library too. We just switch the Backoff field to LinearJitterBackoff with MinWait=0.

Actions #14

Updated by Brett Smith 9 months ago

In the Python SDK, googleapiclient.http._retry_request accepts rand and sleep arguments, which are both 0-argument functions to do what you think to affect the pause between retries. We can adjust the retry strategy by passing our own function(s) into this from arvados.api._retry_request.

Actions #15

Updated by Brett Smith 9 months ago

  • Related to Story #8539: [SDKs/FUSE] Better retry defaults added
Actions

Also available in: Atom PDF