Feature #5538

[SDK] Add retry support to Go arvadosclient

Added by Peter Amstutz over 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
SDKs
Target version:
Start date:
11/04/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

ArvadosClient doesn't retry errors. Add a "Retries" field to ArvadosClient and update ArvadosClient.CallRaw() to perform a similar retry loop as recently implemented in KeepClient.

Retry c.Client.Do() if
  • Do() returns an error (i.e., there was no HTTP response) and method is idempotent (GET or PUT)
  • Do() succeeds but the HTTP response code is retryable (use the same list as _HTTP_CAN_RETRY in the Python SDK: 408, 409, 422, 423, 500, 502, 503, 504)

Subtasks

Task #7717: Review branch 5538-close-idle-connectionsResolvedTom Clegg

Task #7685: Review branch 5538-arvadosclient-retryResolvedTom Clegg


Related issues

Related to Arvados - Bug #3910: [SDKs] Client libraries should retry failed API requests after errors like Gateway Timeout (part 2)Closed09/17/2014

Associated revisions

Revision 6b606a65
Added by Radhika Chippada almost 6 years ago

refs #5538
Merge branch '5538-close-idle-connections'

Revision 5d74e758
Added by Radhika Chippada almost 6 years ago

closes #5538
Merge branch '5538-arvadosclient-retry'

Revision 9aef0397
Added by Tom Clegg almost 6 years ago

Merge branch '5538-test-post-retry' refs #5538

History

#1 Updated by Brett Smith over 6 years ago

  • Category set to SDKs
  • Target version set to Arvados Future Sprints

#2 Updated by Peter Amstutz about 6 years ago

  • Subject changed from [SDK] Add retry support to Go keepclient to [SDK] Add retry support to Go arvadosclient

#3 Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)

#4 Updated by Brett Smith about 6 years ago

Moving to Product Backlog because we need more discussion to decide where this goes in Future Sprints.

#5 Updated by Brett Smith about 6 years ago

  • Target version deleted (Arvados Future Sprints)

#6 Updated by Tom Clegg almost 6 years ago

  • Description updated (diff)

#7 Updated by Tom Clegg almost 6 years ago

In general, POST is not safe to retry automatically after a socket error because we don't know whether the error happened before or after the transaction succeeded.

A safer way to make POST more likely to succeed is to check up front, and avoid using old keep-alive connections for POST requests. For example: before doing a POST or DELETE request, do something like


var MaxIdleConnectionDuration = 30 * time.Second // (this goes in package scope so callers can adjust it)

// (before doing a POST or DELETE, because we won't be able to retry safely after a network error)
if time.Since(c.lastClosedIdles) > MaxIdleConnectionDuration {
        c.lastClosedIdles = time.Now()
        c.Client.Transport.CloseIdleConnections()
}

This could be added independently of the retry logic. Perhaps in a separate branch? It would be good to try this on some real servers (and cloud networks) and find out whether it really works. Simulating a stale keep-alive connection in a test would be great too, but might be harder.

#8 Updated by Brett Smith almost 6 years ago

  • Target version set to Arvados Future Sprints

#9 Updated by Brett Smith almost 6 years ago

  • Assigned To set to Radhika Chippada
  • Target version changed from Arvados Future Sprints to 2015-11-11 sprint

#10 Updated by Radhika Chippada almost 6 years ago

  • Story points set to 1.0

#11 Updated by Radhika Chippada almost 6 years ago

  • Status changed from New to In Progress

#12 Updated by Radhika Chippada almost 6 years ago

Branches 5538-arvadosclient-retry and 5538-close-idle-connections are ready for review.

In branch 5538-arvadosclient-retry, added Scheme to ArvadosClient, which by default is https as before, to facilitate using http protocol in testing.

#13 Updated by Tom Clegg almost 6 years ago

5538-close-idle-connections:

Comment typo DELERE -> DELETE

The comment "Before a POST or DELETE, close any idle connections" doesn't seem to say anything that isn't obvious from the code, but perhaps something like this would help explain it:

"A common failure mode is to reuse a keepalive connection that has been terminated (in a way that we can't detect) for being idle too long. POST and DELETE are not safe to retry automatically, so we minimize such failures by always using a new or recently active socket."

Should add a test case -- maybe just do a GET, set ac.lastClosedIdlesAt=time.Now().Add(-time.Minute), and do a POST, so we know the CloseIdleConnections() code doesn't do anything horrible. (OTOH, testing whether the socket actually gets closed would be nice-to-have, but seems harder and not as important.)

The rest LGTM.

#14 Updated by Radhika Chippada almost 6 years ago

Tom said: Should add a test case -- maybe just do a GET, set ac.lastClosedIdlesAt=time.Now().Add(-time.Minute), and do a POST,

Added this test and made updates to comments as suggested. Thanks.

#15 Updated by Tom Clegg almost 6 years ago

5538-arvadosclient-retry

How about

-               if resp.StatusCode == 408 ||
-                       resp.StatusCode == 409 ||
-                       resp.StatusCode == 422 ||
-                       ...
-               } else {
+               switch resp.StatusCode {
+               case 408, 409, 422, ...:
+                       ...
+               default:

We could use a regular "for" loop for remainingTries to keep the counting all in one place:

-       remainingTries := 1 + c.Retries
        ...
-       for remainingTries > 0 {
+       for attempt := 0; attempt <= c.Retries; attempt++ {
                ...
-               remainingTries -= 1

Probably a good idea to set a bool up front and use it in the retry handler as well as the "close idle socket" code when that's merged, since we want every request to be eligible for exactly one of "close idle socket" or "retry after failure":

        retryable := false
        switch method {
        case "GET", "HEAD", "PUT", "OPTIONS":
                retryable = true
        }

(e.g., PATCH should fall into one category or the other, even if we forget to list it. RFC says it's not idempotent.)

The badResp bool seems to be equivalent to resp == nil. Maybe just check whether resp == nil when deciding what to return?

This wasn't specified, but I think we need a delay between attempts, otherwise we'll blast through all of our attempts quickly instead of giving the server/network a chance to start working. Maybe we can set var RetryDelay = 2 * time.Second so the caller can adjust it if desired, and test cases can set it to zero to avoid wasting time in tests?

I wonder if it would be better to always return the last error, even if it came from Do() and retryable is true? With the current code, you sometimes get the original error from Do() (which you could do type assertions on) but other times (depending on http method) you get a fmt.Errorf(), which has strings but loses the original error types.

tests

I think we could cover a lot more test cases by combining FailHandler and FailThenSucceedHandler into a single stub handler that has a slice of responses.

type APIStub struct {
        respStatus []int
        respBody   []string
}
...
&APIStub{
        respStatus: []int{500,500,200},
        respBody: []string{``,``,`{"ok":"ok"}`},
}
Then, we can do
  • retries=2, status=500,500,500,200 → fail
  • retries=2, status=401,200 → fail
  • retries=2, status=500,401,200 → fail
  • retries=2, status=500,500,200 → succeed

#16 Updated by Tom Clegg almost 6 years ago

Radhika Chippada wrote:

Added this test and made updates to comments as suggested. Thanks.

I think we should move this above the loop in the test case, so we're reusing the same arv for both requests:

                arv, err := MakeArvadosClient()

Rest LGTM, thanks.

#17 Updated by Radhika Chippada almost 6 years ago

db68fd029f681f380f218b9b2f4fec3a315f140c addresses all the feedback from Note #15 (except may be the following).

Probably a good idea to set a bool up front and use it in the retry handler as well as the "close idle socket" code when that's merged, since we want every request to be eligible for exactly one of "close idle socket" or "retry after failure"

I set the bool and used it. However, I did not quite understand 'as well as the "close idle socket" code'. Please take a look at what I have done and let me know if I have missed anything. Thanks.

#18 Updated by Tom Clegg almost 6 years ago

5538-arvadosclient-retry @ db68fd0

Use if !retryable { to decide whether to close idle connections

"POST" shouldn't be in the retryable list (but you were right to add "DELETE", which I missed). Looks like one "create" test is affected:
  • 500,500,200 → should be 500
Couple of tests to add:
  • get 200,500 → 200
  • create 200,500 → 200
  • delete 500,500,200 → 200
  • update 500,500,200 → 200

Everything else LGTM, thanks!

#19 Updated by Radhika Chippada almost 6 years ago

3ded4c192b1fc655d9fffa225f792055f10f78c9

Use if !retryable { to decide whether to close idle connections

Done

"POST" shouldn't be in the retryable list (but you were right to add "DELETE", which I missed).

Corrected the retryable list

Couple of tests to add

Added these additional tests

Looks like one "create" test is affected: 500,500,200 → should be 500

No, this will be 200, not 500. Even though it is not retryable, the logic from line #221 of arvadosclient applies, not the conditional from line #206. This is because we have a bad status code, not an error during Client.do(). So, this will behave the same as "delete 500,500,200 → 200" and "update 500,500,200 → 200" (with the additional benefit of closing idle connections for POST). I will hold off merging into master until you are able to take a look at this and chime in on this.

Added one more test that simulates http error during server request to test that path as well.

Thanks.

#20 Updated by Tom Clegg almost 6 years ago

Radhika Chippada wrote:

Looks like one "create" test is affected: 500,500,200 → should be 500

No, this will be 200, not 500.

Ah, you're right. The test I was trying to think of is "if Do() fails, we must retry GET, and must not retry POST".

The added "error from Do()" tests don't seem to test the retry behavior, though: they only test "fail if Do() always fails" which is true whether or not the client retries. Not immediately obvious how to test this, though. Is there a way to make the API stub return an invalid HTTP response, so Do() fails?

We should be using ErrorMatches instead of strings.Contains(...),Equals,true so the failure message is useful...

#21 Updated by Radhika Chippada almost 6 years ago

The added "error from Do()" tests don't seem to test the retry behavior, though: they only test "fail if Do() always fails" which is true whether or not the client retries. Not immediately obvious how to test this, though. Is there a way to make the API stub return an invalid HTTP response, so Do() fails?

As we discussed on irc, unfortunately in the case of error, the ServeHttp method on the stub is not executed at all and hence is not possible to check for the attempt count from the stub. This of course is a huge improvement over not having any error tests at all. I did make updates to the stub parameters to use nil response codes and body to make the test less confusing.

We should be using ErrorMatches instead of strings.Contains(...),Equals,true so the failure message is useful...

Done

#22 Updated by Radhika Chippada almost 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:5d74e7583ad4b2d5dfd9eec50d625eded3739d1c.

#23 Updated by Radhika Chippada almost 6 years ago

Comments regarding 898c73ef5179e3c0cdc088ca369500b40df23b89 by Tom:

  • This error producing logic is much better and we are now able to actually verify that retries not/happen as expected. Thanks.
  • Please update my previous comment at line# 325 of arvadosclient_test.go ("Use nil responseBody to simulate error during request processing ...")
  • Does not hurt to add an explanation that POST are not retryable in case of http errors, and hence the first error code of -1 is expected for "create", 0, -1, []int{-1, 200}, []string{``, `{"ok":"ok"}`},"

Also available in: Atom PDF