Feature #5538
closed[SDK] Add retry support to Go arvadosclient
Description
ArvadosClient doesn't retry errors. Add a "Retries" field to ArvadosClient and update ArvadosClient.CallRaw() to perform a similar retry loop as recently implemented in KeepClient.
Retryc.Client.Do()
if
Do()
returns an error (i.e., there was no HTTP response) and method is idempotent (GET or PUT)Do()
succeeds but the HTTP response code is retryable (use the same list as _HTTP_CAN_RETRY in the Python SDK: 408, 409, 422, 423, 500, 502, 503, 504)
Updated by Brett Smith over 9 years ago
- Category set to SDKs
- Target version set to Arvados Future Sprints
Updated by Peter Amstutz about 9 years ago
- Subject changed from [SDK] Add retry support to Go keepclient to [SDK] Add retry support to Go arvadosclient
Updated by Brett Smith about 9 years ago
Moving to Product Backlog because we need more discussion to decide where this goes in Future Sprints.
Updated by Brett Smith about 9 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Tom Clegg about 9 years ago
In general, POST is not safe to retry automatically after a socket error because we don't know whether the error happened before or after the transaction succeeded.
A safer way to make POST more likely to succeed is to check up front, and avoid using old keep-alive connections for POST requests. For example: before doing a POST or DELETE request, do something like
var MaxIdleConnectionDuration = 30 * time.Second // (this goes in package scope so callers can adjust it)
// (before doing a POST or DELETE, because we won't be able to retry safely after a network error)
if time.Since(c.lastClosedIdles) > MaxIdleConnectionDuration {
c.lastClosedIdles = time.Now()
c.Client.Transport.CloseIdleConnections()
}
This could be added independently of the retry logic. Perhaps in a separate branch? It would be good to try this on some real servers (and cloud networks) and find out whether it really works. Simulating a stale keep-alive connection in a test would be great too, but might be harder.
Updated by Brett Smith about 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith about 9 years ago
- Assigned To set to Radhika Chippada
- Target version changed from Arvados Future Sprints to 2015-11-11 sprint
Updated by Radhika Chippada about 9 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada about 9 years ago
Branches 5538-arvadosclient-retry and 5538-close-idle-connections are ready for review.
In branch 5538-arvadosclient-retry, added Scheme to ArvadosClient, which by default is https as before, to facilitate using http protocol in testing.
Updated by Tom Clegg about 9 years ago
5538-close-idle-connections:
Comment typo DELERE -> DELETE
The comment "Before a POST or DELETE, close any idle connections" doesn't seem to say anything that isn't obvious from the code, but perhaps something like this would help explain it:
"A common failure mode is to reuse a keepalive connection that has been terminated (in a way that we can't detect) for being idle too long. POST and DELETE are not safe to retry automatically, so we minimize such failures by always using a new or recently active socket."
Should add a test case -- maybe just do a GET, set ac.lastClosedIdlesAt=time.Now().Add(-time.Minute), and do a POST, so we know the CloseIdleConnections() code doesn't do anything horrible. (OTOH, testing whether the socket actually gets closed would be nice-to-have, but seems harder and not as important.)
The rest LGTM.
Updated by Radhika Chippada about 9 years ago
Tom said: Should add a test case -- maybe just do a GET, set ac.lastClosedIdlesAt=time.Now().Add(-time.Minute), and do a POST,
Added this test and made updates to comments as suggested. Thanks.
Updated by Tom Clegg about 9 years ago
5538-arvadosclient-retry
How about
- if resp.StatusCode == 408 ||
- resp.StatusCode == 409 ||
- resp.StatusCode == 422 ||
- ...
- } else {
+ switch resp.StatusCode {
+ case 408, 409, 422, ...:
+ ...
+ default:
We could use a regular "for" loop for remainingTries to keep the counting all in one place:
- remainingTries := 1 + c.Retries
...
- for remainingTries > 0 {
+ for attempt := 0; attempt <= c.Retries; attempt++ {
...
- remainingTries -= 1
Probably a good idea to set a bool up front and use it in the retry handler as well as the "close idle socket" code when that's merged, since we want every request to be eligible for exactly one of "close idle socket" or "retry after failure":
retryable := false switch method { case "GET", "HEAD", "PUT", "OPTIONS": retryable = true }
(e.g., PATCH should fall into one category or the other, even if we forget to list it. RFC says it's not idempotent.)
The badResp
bool seems to be equivalent to resp == nil
. Maybe just check whether resp == nil
when deciding what to return?
This wasn't specified, but I think we need a delay between attempts, otherwise we'll blast through all of our attempts quickly instead of giving the server/network a chance to start working. Maybe we can set var RetryDelay = 2 * time.Second
so the caller can adjust it if desired, and test cases can set it to zero to avoid wasting time in tests?
I wonder if it would be better to always return the last error, even if it came from Do() and retryable is true? With the current code, you sometimes get the original error from Do() (which you could do type assertions on) but other times (depending on http method) you get a fmt.Errorf(), which has strings but loses the original error types.
tests¶
I think we could cover a lot more test cases by combining FailHandler and FailThenSucceedHandler into a single stub handler that has a slice of responses.
type APIStub struct { respStatus []int respBody []string } ... &APIStub{ respStatus: []int{500,500,200}, respBody: []string{``,``,`{"ok":"ok"}`}, }Then, we can do
- retries=2, status=500,500,500,200 → fail
- retries=2, status=401,200 → fail
- retries=2, status=500,401,200 → fail
- retries=2, status=500,500,200 → succeed
Updated by Tom Clegg about 9 years ago
Radhika Chippada wrote:
Added this test and made updates to comments as suggested. Thanks.
I think we should move this above the loop in the test case, so we're reusing the same arv for both requests:
arv, err := MakeArvadosClient()
Rest LGTM, thanks.
Updated by Radhika Chippada about 9 years ago
db68fd029f681f380f218b9b2f4fec3a315f140c addresses all the feedback from Note #15 (except may be the following).
Probably a good idea to set a bool up front and use it in the retry handler as well as the "close idle socket" code when that's merged, since we want every request to be eligible for exactly one of "close idle socket" or "retry after failure"
I set the bool and used it. However, I did not quite understand 'as well as the "close idle socket" code'. Please take a look at what I have done and let me know if I have missed anything. Thanks.
Updated by Tom Clegg about 9 years ago
5538-arvadosclient-retry @ db68fd0
Use if !retryable {
to decide whether to close idle connections
- 500,500,200 → should be 500
- get 200,500 → 200
- create 200,500 → 200
- delete 500,500,200 → 200
- update 500,500,200 → 200
Everything else LGTM, thanks!
Updated by Radhika Chippada about 9 years ago
3ded4c192b1fc655d9fffa225f792055f10f78c9
Use if !retryable { to decide whether to close idle connections
Done
"POST" shouldn't be in the retryable list (but you were right to add "DELETE", which I missed).
Corrected the retryable list
Couple of tests to add
Added these additional tests
Looks like one "create" test is affected: 500,500,200 → should be 500
No, this will be 200, not 500. Even though it is not retryable, the logic from line #221 of arvadosclient applies, not the conditional from line #206. This is because we have a bad status code, not an error during Client.do(). So, this will behave the same as "delete 500,500,200 → 200" and "update 500,500,200 → 200" (with the additional benefit of closing idle connections for POST). I will hold off merging into master until you are able to take a look at this and chime in on this.
Added one more test that simulates http error during server request to test that path as well.
Thanks.
Updated by Tom Clegg about 9 years ago
Radhika Chippada wrote:
Looks like one "create" test is affected: 500,500,200 → should be 500
No, this will be 200, not 500.
Ah, you're right. The test I was trying to think of is "if Do() fails, we must retry GET, and must not retry POST".
The added "error from Do()" tests don't seem to test the retry behavior, though: they only test "fail if Do() always fails" which is true whether or not the client retries. Not immediately obvious how to test this, though. Is there a way to make the API stub return an invalid HTTP response, so Do() fails?
We should be using ErrorMatches
instead of strings.Contains(...),Equals,true
so the failure message is useful...
Updated by Radhika Chippada about 9 years ago
The added "error from Do()" tests don't seem to test the retry behavior, though: they only test "fail if Do() always fails" which is true whether or not the client retries. Not immediately obvious how to test this, though. Is there a way to make the API stub return an invalid HTTP response, so Do() fails?
As we discussed on irc, unfortunately in the case of error, the ServeHttp method on the stub is not executed at all and hence is not possible to check for the attempt count from the stub. This of course is a huge improvement over not having any error tests at all. I did make updates to the stub parameters to use nil response codes and body to make the test less confusing.
We should be using ErrorMatches instead of strings.Contains(...),Equals,true so the failure message is useful...
Done
Updated by Radhika Chippada about 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:5d74e7583ad4b2d5dfd9eec50d625eded3739d1c.
Updated by Radhika Chippada about 9 years ago
Comments regarding 898c73ef5179e3c0cdc088ca369500b40df23b89 by Tom:
- This error producing logic is much better and we are now able to actually verify that retries not/happen as expected. Thanks.
- Please update my previous comment at line# 325 of arvadosclient_test.go ("Use nil responseBody to simulate error during request processing ...")
- Does not hurt to add an explanation that POST are not retryable in case of http errors, and hence the first error code of -1 is expected for "create", 0, -1, []int{-1, 200}, []string{``, `{"ok":"ok"}`},"