Feature #19972
Updated by Peter Amstutz almost 2 years ago
arvados.Client currently does not perform any retry behavior when a request fails with a 5xx error. 5xx errors should be retried with a random backoff time. Possible behavior (Fibonacci sequence): behavior: 1st attempt wait 0-1 seconds 2nd attempt, wait 1-2 seconds 3nd attempt, wait 2-3 seconds 4nd attempt, wait 3-5 seconds 5th attempt, wait 5-8 seconds Another approach is randomized exponential backoff. As it happens, this is what the Google Python client does: 1st attempt wait 0-1 seconds 2nd attempt, wait 0-2 seconds 3nd attempt, wait 0-4 seconds 4nd attempt, wait 0-8 seconds 5th attempt, wait 0-16 seconds (in this case, it probably makes sense to set a ceiling in terms of amount of time spent retrying rather than the number of retries). The idea being that if the API server is overloaded, the clients should all select different retry intervals so that they don't just flood the API server by all retrying at a fixed time. The caller should be able to control the retry behavior, e.g., * @client.RetryWindow = time.Minute@ means return an error if the request is still failing after 1 minute of retrying on the prevailing schedule * @client.RetryWindow = 0@ means don't retry automatically A request that uses a caller-provided context should also make sure to notice and return early when the context cancels during a retry delay.