Actions
Bug #16134
closed[controller] handle unreachable federation peer better
Story points:
-
Release:
Release relationship:
Auto
Description
When an arvados cluster is configured with an unreachable federation peer, things go south real fast, and arvados-controller quickly consumes all the file descriptors it can get:
Feb 05 22:00:45 9tee4.arvadosapi.com arvados-controller[22394]: {"PID":22394,"RequestID":"req-tuynvloji3hz9h42b16w","level":"info","msg":"response","remoteAddr":"127.0.0.1:33622","reqBytes":0,"reqForwardedFor":"10.100.32.5","reqHost":"9tee4.arvadosapi.com","reqMethod":"GET","reqPath":"arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65","reqQuery":"","respBody":"{\"errors\":[\"errors: [Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp 54.209.184.185:443: i/o timeout request failed: https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: 502 Bad Gateway: errors: [request failed: https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: 502 Bad Gateway: errors: [Get https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp 10.25.0.6:443: socket: too many open files Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp: lookup 4xphq.arvadosapi.com on 127.0.0.1:53: dial udp 127.0.0.1:53: socket: too many open files Get https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp: lookup 9tee4.arvadosapi.com on 127.0.0.1:53: dial udp 127.0.0.1:53: socket: too many open files] Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86","respBytes":4853,"respStatus":"Bad Gateway","respStatusCode":502,"time":"2020-02-05T22:00:45.131869763Z","timeToStatus":54.670812,"timeTotal":54.670929,"timeWriteBody":0.000117}
Updated by Ward Vandewege almost 5 years ago
- Related to Bug #16133: [controller] add loop prevention to federation lookups in new code path added
Updated by Tom Clegg almost 5 years ago
Suspect the file descriptor issue was really caused by #16133 and the problem here is just that a slow peer causes all requests that need an answer from all remotes (like getting a collection by PDH that no other peer has) to be slow.
Updated by Peter Amstutz almost 5 years ago
- Status changed from New to Resolved
Confirmed side effect that went away when #16133 was fixed.
Actions