Bug #7546

[SDK] Go keepclient should retry PUTs on network errors

Added by Brett Smith almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
-
Target version:
Start date:
10/14/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Keepproxy (which uses keepclient) is failing requests with the error "read tcp 10.26.0.7:25107: use of closed network connection". We have had similar errors in the past which were due to connection reuse in the Go http package. This has the side effect that if it tries to reuse a connection that's been closed, it will return an error instead of trying to reestablish a connection. The keepclient package should handle closed connection errors by retrying the request.


Subtasks

Task #7559: Review branch: 7546-put-retryResolvedRadhika Chippada


Related issues

Copied from Arvados - Bug #7491: [SDK] Go keepclient should retry on network errorsResolved10/08/2015

Associated revisions

Revision a81ba64f
Added by Radhika Chippada almost 4 years ago

closes #7546
Merge branch '7546-put-retry'

History

#1 Updated by Radhika Chippada almost 4 years ago

  • Status changed from New to In Progress

#2 Updated by Radhika Chippada almost 4 years ago

- Branch 7546-put-retry at commit 08a15a1e6b8e6f2e44f18328c1f6dd25343cffc2 is ready for review.

- I added the server to retry list if "status.statusCode 408 || status.statusCode 429 || status.statusCode >= 500" (using the list in GetOrHead). Do we need to include statusCode == 0 in this list? It appears that using a closed connection results in status.statusCode of 0 at line #205 of uploadToKeepServer method and this may need to be retried?

- One observation as I was working on putReplicas method: The requestId calculation at the beginning of this method uses "locator", which is not yet set. Shouldn't it be using "hash" instead of "locator" here?

 requestId := fmt.Sprintf("%x", md5.Sum([]byte(locator+time.Now().String())))[0:8] 

#3 Updated by Peter Amstutz almost 4 years ago

Radhika Chippada wrote:

- Branch 7546-put-retry at commit 08a15a1e6b8e6f2e44f18328c1f6dd25343cffc2 is ready for review.

- I added the server to retry list if "status.statusCode 408 || status.statusCode 429 || status.statusCode >= 500" (using the list in GetOrHead). Do we need to include statusCode 0 in this list? It appears that using a closed connection results in status.statusCode of 0 at line #205 of uploadToKeepServer method and this may need to be retried?

Yes, it should retry on statusCode 0 because that indicates a network error, which we definitely want to retry.

In addition, it probably should not retry on status code 503, because that's the error returned by keepstore when the keep server is full and cannot accept any more blocks.

- One observation as I was working on putReplicas method: The requestId calculation at the beginning of this method uses "locator", which is not yet set. Shouldn't it be using "hash" instead of "locator" here?

[...]

Yes, that looks like a bug.

#4 Updated by Radhika Chippada almost 4 years ago

Peter,
Addressed those comments.

One other observation / thought. The tests are now taking longer to run, which is understandable because now we are retrying. However, I am wondering if it is agreeable to set retry count to 0 for tests where retry is not needed / not going to make any difference one way or the other. Example of one such test seems to be: TestPutWithTooManyFail

Thanks.

#5 Updated by Radhika Chippada almost 4 years ago

Updated three tests to use kc.Retries = 0 (where retries are not needed). Now the test run time has come down from current master 49s to 10s.

#6 Updated by Peter Amstutz almost 4 years ago

Could you tweak the comment "// Timeout, too many requests, or other server side failure" to mention that error 503 means "keep server full" so we don't want to retry that case. Then please go ahead and merge.

#7 Updated by Radhika Chippada almost 4 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:a81ba64fcf67efcdb1323402612ca3fe5abf7b92.

Also available in: Atom PDF