Bug #16134

Updated by Ward Vandewege 6 months ago

When an arvados cluster is configured with an unreachable federation peer, things go south real fast, and arvados-controller quickly consumes all the file descriptors it can get:

<pre>
Feb 05 22:00:45 9tee4.arvadosapi.com arvados-controller[22394]: {"PID":22394,"RequestID":"req-tuynvloji3hz9h42b16w","level":"info","msg":"response","remoteAddr":"127.0.0.1:33622","reqBytes":0,"reqForwardedFor":"10.100.32.5","reqHost":"9tee4.arvadosapi.com","reqMethod":"GET","reqPath":"arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65","reqQuery":"","respBody":"{\"errors\":[\"errors: [Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp 54.209.184.185:443: i/o timeout request failed: https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: 502 Bad Gateway: errors: [request failed: https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: 502 Bad Gateway: errors: [Get https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp 10.25.0.6:443: socket: too many open files Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp: lookup 4xphq.arvadosapi.com on 127.0.0.1:53: dial udp 127.0.0.1:53: socket: too many open files Get https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp: lookup 9tee4.arvadosapi.com on 127.0.0.1:53: dial udp 127.0.0.1:53: socket: too many open files] Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86","respBytes":4853,"respStatus":"Bad Gateway","respStatusCode":502,"time":"2020-02-05T22:00:45.131869763Z","timeToStatus":54.670812,"timeTotal":54.670929,"timeWriteBody":0.000117}
</pre>

Back