Idea #8539
Updated by Peter Amstutz almost 9 years ago
Currently the Python keepclient has a default num_retries of 0.
The command line tools (including FUSE) use a default num_retries of 3 by virtue of using a common definition for --retries in @arvados.command._util.retry_opt@
The Python keep client retry behavior is to wait 2 seconds with exponential backoff. So for a num_retries of 3, it will only wait a total of 2+4+8 = 14 seconds before giving up.
The Go Keep client does not implement any backoff at all, and uses a default retry count of 2.
This means the command line tools give up after ~15 seconds, Python scripts using the SDK will give up immediately, and clients using the Go SDK will give up pretty quickly.
Three suggestions:
1) Choose a conservative default behavior. The SDK should provide a default num_retries that is greater than zero, and it should be adjusted to provide a reasonable window to accommodate temporary outage (from several minutes to tens of minutes).
2) Harmonize Python and Go SDK retry behavior.
3) Cap the growth of the exponential wait function to 32 or 64 seconds. Otherwise, the client will spend increasing amounts of time waiting for a service that has already been restored, but won't have noticed yet.
For example, if we implement (1) and (3) to retry for just over five minutes would imply num_retries=9 (2+4+8+16+32+64+64+64+64=318)