Project

General

Profile

Actions

Idea #8539

closed

[SDKs/FUSE] Better retry defaults

Added by Peter Amstutz almost 9 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
Story points:
-

Description

Currently the Python keepclient has a default num_retries of 0.

The command line tools (including FUSE) use a default num_retries of 3 by virtue of using a common definition for --retries in arvados.command._util.retry_opt

The Python keep client retry behavior is to wait 2 seconds with exponential backoff. So for a num_retries of 3, it will only wait a total of 2+4+8 = 14 seconds before giving up.

The Go Keep client does not implement any backoff at all, and uses a default retry count of 2.

This means the command line tools give up after ~15 seconds, Python scripts using the SDK will give up immediately, and clients using the Go SDK will give up pretty quickly.

Three suggestions:

1) Choose a conservative default behavior. The SDK should provide a default num_retries that is greater than zero, and it should be adjusted to provide a reasonable window to accommodate temporary outage (from several minutes to tens of minutes).

2) Harmonize Python and Go SDK retry behavior.

3) Cap the growth of the exponential wait function to 32 or 64 seconds. Otherwise, the client will spend increasing amounts of time waiting for a service that has already been restored, but won't have noticed yet.

For example, if we implement (1) and (3) to retry for just over five minutes would imply num_retries=9 (2+4+8+16+32+64+64+64+64=318)


Related issues 5 (1 open4 closed)

Related to Arvados - Bug #7979: Python sdk needs more agressive retriesClosed12/08/2015Actions
Related to Arvados - Bug #7971: Python SDK Keep timeouts on su92l are too agressiveClosed12/08/2015Actions
Related to Arvados - Bug #8148: [FUSE] When we give up trying to write a block, the next operation on the file should failNewActions
Related to Arvados - Bug #12684: Let user specify a retry strategy on the client object, used for all API callsResolvedBrett Smith05/09/2023Actions
Related to Arvados - Idea #20107: Research retry strategies when SDK API calls return 5xx errorsResolvedBrett SmithActions
Actions #1

Updated by Peter Amstutz almost 9 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 9 years ago

  • Release set to 12
Actions #3

Updated by Peter Grandi almost 9 years ago

I was asking these questions on IRC and this seems to me a good summary and a good plan. It is indeed very good for mostly-stateless Keepstores to be rebootable semi-transparently, a bit like NFS servers.

Actions #4

Updated by Brett Smith almost 9 years ago

In general, you want to retry an amount proportional to the amount of work you've already put into the effort. If you start an arv put and the network is down, that should just fail immediately so you know there's a problem and you can start fixing it. But if arv put has been running for hours and then the network is down, that's worth retrying for a while.

The lifetime of the process is not sufficient to determine how much effort has already gone into this. The last component of a long-running pipeline should retry for a good while, even if this is the first request it has made.

The ticket was written with Keep in mind, but much of the same rationale applies to the API server itself.

It would be better for the configuration knob to be "retry up to N seconds," rather than literally counting retries.

Actions #5

Updated by Tom Morris about 6 years ago

  • Release deleted (12)
Actions #7

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Status changed from New to Resolved
Actions #9

Updated by Peter Amstutz over 1 year ago

  • Release deleted (60)
Actions #10

Updated by Brett Smith over 1 year ago

  • Related to Bug #12684: Let user specify a retry strategy on the client object, used for all API calls added
Actions #11

Updated by Brett Smith over 1 year ago

  • Related to Idea #20107: Research retry strategies when SDK API calls return 5xx errors added
Actions

Also available in: Atom PDF