Bug #7491
closed[SDK] Go keepclient should retry on network errors
Description
Keepproxy (which uses keepclient) is failing requests with the error "read tcp 10.26.0.7:25107: use of closed network connection". We have had similar errors in the past which were due to connection reuse in the Go http package. This has the side effect that if it tries to reuse a connection that's been closed, it will return an error instead of trying to reestablish a connection. The keepclient package should handle closed connection errors by retrying the request.
Updated by Tom Clegg over 9 years ago
7491-keepclient-bugs @ 8ff3305 (fixes for bugs noticed while working on "keepexercise")
In particular "forgot to close Body of a non-200 response" could be related (it's not unusual to see 404s in keep, for example). Perhaps Go leaves the connection alive for a long time (expecting us to Close()
it), then eventually times out and reuses it for a future request, only to find that by then the underlying network connection has closed.
Updated by Nico César over 9 years ago
extra data: when this happens:
2015-10-08_19:57:07.51104 2015/10/08 19:57:07 168.61.219.240 GET /ae344ba0e1c17b543a204ed0331c6c76+67108864+A92ea0097c6eaaba8268f1c48d3ad15a6237cb9da@56293cd6 200 67108864 62878888 http://keep12.qr1hi.arvadosapi.com:25107/ae344ba0e1c17b543a204ed0331c6c76+67108864+A92ea0097c6eaaba8268f1c48d3ad15a6237cb9da@56293cd6 Reader failed checksum
I see that the TCP socket isn't up anymore
from:
keepproxy.qr1hi:/home/nico# ss -a -n | grep 10.26.0.7:25107 tcp ESTAB 0 0 10.26.0.12:49250 10.26.0.7:25107
to
keepproxy.qr1hi:/home/nico# ss -a -n | grep 10.26.0.7:25107 1!keepproxy.qr1hi:/home/nico#
Updated by Peter Amstutz over 9 years ago
- Target version set to Arvados Future Sprints
Updated by Tom Clegg over 9 years ago
Preliminary comments on 7491-go-keepclient-retry
Probably better to try all keep services in probe order, then go back and try the retryable ones again, like we do in the Python SDK. Slightly better chances of server A coming back to life if we spend a bit of time trying server B, C, D before coming back to it (can also avoid waiting through two of the same slow error).already addressed- Should handle 408 and 429 as retryable errors.
- Move "return different errors" stuff to #7492 where it's needed, instead of creeping it into this branch:
Return BlockNotFound only if all servers succeeded and replied exactly 404.If not returning BlockNotFound, return something likefmt.Errorf("%+v", errs)
so the caller can see/report what went wrong.
Updated by Tom Clegg over 9 years ago
We should probably retry in putReplicas()
as well.
Updated by Tom Clegg over 9 years ago
+1 de-duplicating code, phew
The method!="GET"
case needs resp.Body.Close()
The check for if resp.Body != nil
is superfluous: When err is nil, resp always contains a non-nil resp.Body.
Any particular reason to change %q
back to %s
where we include <4K of response body in an error message, or was this just a merge error? I changed it to %q
so we wouldn't get overly gnarly error logs if body has something we don't expect (201??).
This should print the actual method used, now that it's not always GET:
log.Printf("DEBUG: GET %s failed: %v", locator, errs)
The "writable" map here ({ks.url: ""}
) looks like an error. (3x occurrences) I suppose it isn't actually used here since we're only reading, but in that case it would probably make more sense if it were just nil
.
kc.SetServiceRoots(map[string]string{"x": ks.url}, map[string]string{ks.url: ""}, nil)
Updated by Brett Smith over 9 years ago
- Target version changed from Arvados Future Sprints to 2015-10-28 sprint
Updated by Brett Smith over 9 years ago
- Story points changed from 1.0 to 0.5
.5 to finish the GET retry branch. PUT has been split out.
Updated by Tom Clegg over 9 years ago
7491-go-keepclient-retry LGTM @ 1f37045, thanks
Updated by Peter Amstutz over 9 years ago
- Status changed from New to Resolved
Applied in changeset arvados|commit:f19e50e99dc4939199a9b9a4381a775d40453267.