Bug #7491

[SDK] Go keepclient should retry on network errors

Added by Peter Amstutz about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/08/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Keepproxy (which uses keepclient) is failing requests with the error "read tcp 10.26.0.7:25107: use of closed network connection". We have had similar errors in the past which were due to connection reuse in the Go http package. This has the side effect that if it tries to reuse a connection that's been closed, it will return an error instead of trying to reestablish a connection. The keepclient package should handle closed connection errors by retrying the request.


Subtasks

Task #7493: Review 7491-keepclient-bugsResolvedPeter Amstutz

Task #7545: Review 7491-go-keepclient-retryResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #7492: [keepproxy] Report upstream server errors as 502 Bad Gateway instead of 404Resolved10/14/2015

Copied to Arvados - Bug #7546: [SDK] Go keepclient should retry PUTs on network errorsResolved10/14/2015

Associated revisions

Revision e55f6e9f
Added by Tom Clegg about 4 years ago

Merge branch '7491-keepclient-bugs' refs #7491

Revision f19e50e9
Added by Peter Amstutz almost 4 years ago

Merge branch '7491-go-keepclient-retry' closes #7491

History

#1 Updated by Peter Amstutz about 4 years ago

  • Description updated (diff)

#3 Updated by Tom Clegg about 4 years ago

7491-keepclient-bugs @ 8ff3305 (fixes for bugs noticed while working on "keepexercise")

In particular "forgot to close Body of a non-200 response" could be related (it's not unusual to see 404s in keep, for example). Perhaps Go leaves the connection alive for a long time (expecting us to Close() it), then eventually times out and reuses it for a future request, only to find that by then the underlying network connection has closed.

#4 Updated by Nico C├ęsar about 4 years ago

extra data: when this happens:

2015-10-08_19:57:07.51104 2015/10/08 19:57:07 168.61.219.240 GET /ae344ba0e1c17b543a204ed0331c6c76+67108864+A92ea0097c6eaaba8268f1c48d3ad15a6237cb9da@56293cd6 200 67108864 
62878888 http://keep12.qr1hi.arvadosapi.com:25107/ae344ba0e1c17b543a204ed0331c6c76+67108864+A92ea0097c6eaaba8268f1c48d3ad15a6237cb9da@56293cd6 Reader failed checksum

I see that the TCP socket isn't up anymore
from:

keepproxy.qr1hi:/home/nico# ss -a -n | grep 10.26.0.7:25107
tcp    ESTAB      0      0             10.26.0.12:49250         10.26.0.7:25107 

to

keepproxy.qr1hi:/home/nico# ss -a -n | grep 10.26.0.7:25107
1!keepproxy.qr1hi:/home/nico#

#5 Updated by Peter Amstutz about 4 years ago

  • Target version set to Arvados Future Sprints

#6 Updated by Peter Amstutz about 4 years ago

  • Story points set to 1.0

#7 Updated by Peter Amstutz about 4 years ago

  • Assigned To set to Peter Amstutz

#8 Updated by Tom Clegg about 4 years ago

Preliminary comments on 7491-go-keepclient-retry

  • Probably better to try all keep services in probe order, then go back and try the retryable ones again, like we do in the Python SDK. Slightly better chances of server A coming back to life if we spend a bit of time trying server B, C, D before coming back to it (can also avoid waiting through two of the same slow error). already addressed
  • Should handle 408 and 429 as retryable errors.
  • Move "return different errors" stuff to #7492 where it's needed, instead of creeping it into this branch:
    • Return BlockNotFound only if all servers succeeded and replied exactly 404.
    • If not returning BlockNotFound, return something like fmt.Errorf("%+v", errs) so the caller can see/report what went wrong.

#9 Updated by Tom Clegg about 4 years ago

We should probably retry in putReplicas() as well.

#10 Updated by Tom Clegg about 4 years ago

+1 de-duplicating code, phew

The method!="GET" case needs resp.Body.Close()

The check for if resp.Body != nil is superfluous: When err is nil, resp always contains a non-nil resp.Body.

Any particular reason to change %q back to %s where we include <4K of response body in an error message, or was this just a merge error? I changed it to %q so we wouldn't get overly gnarly error logs if body has something we don't expect (201??).

This should print the actual method used, now that it's not always GET:

log.Printf("DEBUG: GET %s failed: %v", locator, errs)

The "writable" map here ({ks.url: ""}) looks like an error. (3x occurrences) I suppose it isn't actually used here since we're only reading, but in that case it would probably make more sense if it were just nil.

kc.SetServiceRoots(map[string]string{"x": ks.url}, map[string]string{ks.url: ""}, nil)

#11 Updated by Brett Smith about 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-28 sprint

#12 Updated by Brett Smith almost 4 years ago

  • Story points changed from 1.0 to 0.5

.5 to finish the GET retry branch. PUT has been split out.

#13 Updated by Tom Clegg almost 4 years ago

7491-go-keepclient-retry LGTM @ 1f37045, thanks

#14 Updated by Peter Amstutz almost 4 years ago

  • Status changed from New to Resolved

Applied in changeset arvados|commit:f19e50e99dc4939199a9b9a4381a775d40453267.

Also available in: Atom PDF