Bug #7546
closed[SDK] Go keepclient should retry PUTs on network errors
Description
Keepproxy (which uses keepclient) is failing requests with the error "read tcp 10.26.0.7:25107: use of closed network connection". We have had similar errors in the past which were due to connection reuse in the Go http package. This has the side effect that if it tries to reuse a connection that's been closed, it will return an error instead of trying to reestablish a connection. The keepclient package should handle closed connection errors by retrying the request.
Related issues
Updated by Radhika Chippada about 9 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada about 9 years ago
- Branch 7546-put-retry at commit 08a15a1e6b8e6f2e44f18328c1f6dd25343cffc2 is ready for review.
- I added the server to retry list if "status.statusCode 408 || status.statusCode 429 || status.statusCode >= 500" (using the list in GetOrHead). Do we need to include statusCode == 0 in this list? It appears that using a closed connection results in status.statusCode of 0 at line #205 of uploadToKeepServer method and this may need to be retried?
- One observation as I was working on putReplicas method: The requestId calculation at the beginning of this method uses "locator", which is not yet set. Shouldn't it be using "hash" instead of "locator" here?
requestId := fmt.Sprintf("%x", md5.Sum([]byte(locator+time.Now().String())))[0:8]
Updated by Peter Amstutz about 9 years ago
Radhika Chippada wrote:
- Branch 7546-put-retry at commit 08a15a1e6b8e6f2e44f18328c1f6dd25343cffc2 is ready for review.
- I added the server to retry list if "status.statusCode 408 || status.statusCode 429 || status.statusCode >= 500" (using the list in GetOrHead). Do we need to include statusCode 0 in this list? It appears that using a closed connection results in status.statusCode of 0 at line #205 of uploadToKeepServer method and this may need to be retried?
Yes, it should retry on statusCode 0 because that indicates a network error, which we definitely want to retry.
In addition, it probably should not retry on status code 503, because that's the error returned by keepstore when the keep server is full and cannot accept any more blocks.
- One observation as I was working on putReplicas method: The requestId calculation at the beginning of this method uses "locator", which is not yet set. Shouldn't it be using "hash" instead of "locator" here?
[...]
Yes, that looks like a bug.
Updated by Radhika Chippada about 9 years ago
Peter,
Addressed those comments.
One other observation / thought. The tests are now taking longer to run, which is understandable because now we are retrying. However, I am wondering if it is agreeable to set retry count to 0 for tests where retry is not needed / not going to make any difference one way or the other. Example of one such test seems to be: TestPutWithTooManyFail
Thanks.
Updated by Radhika Chippada about 9 years ago
Updated three tests to use kc.Retries = 0 (where retries are not needed). Now the test run time has come down from current master 49s to 10s.
Updated by Peter Amstutz about 9 years ago
Could you tweak the comment "// Timeout, too many requests, or other server side failure" to mention that error 503 means "keep server full" so we don't want to retry that case. Then please go ahead and merge.
Updated by Radhika Chippada about 9 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:a81ba64fcf67efcdb1323402612ca3fe5abf7b92.