Bug #16134

[controller] handle unreachable federation peer better

Added by Ward Vandewege 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-
Release relationship:
Auto

Description

When an arvados cluster is configured with an unreachable federation peer, things go south real fast, and arvados-controller quickly consumes all the file descriptors it can get:

Feb 05 22:00:45 9tee4.arvadosapi.com arvados-controller[22394]: {"PID":22394,"RequestID":"req-tuynvloji3hz9h42b16w","level":"info","msg":"response","remoteAddr":"127.0.0.1:33622","reqBytes":0,"reqForwardedFor":"10.100.32.5","reqHost":"9tee4.arvadosapi.com","reqMethod":"GET","reqPath":"arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65","reqQuery":"","respBody":"{\"errors\":[\"errors: [Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp 54.209.184.185:443: i/o timeout request failed: https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: 502 Bad Gateway: errors: [request failed: https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: 502 Bad Gateway: errors: [Get https://c97qk.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp 10.25.0.6:443: socket: too many open files Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp: lookup 4xphq.arvadosapi.com on 127.0.0.1:53: dial udp 127.0.0.1:53: socket: too many open files Get https://9tee4.arvadosapi.com/arvados/v1/collections/9f26a86b6030a69ad222cf67d71c9502+65: dial tcp: lookup 9tee4.arvadosapi.com on 127.0.0.1:53: dial udp 127.0.0.1:53: socket: too many open files] Get https://4xphq.arvadosapi.com/arvados/v1/collections/9f26a86","respBytes":4853,"respStatus":"Bad Gateway","respStatusCode":502,"time":"2020-02-05T22:00:45.131869763Z","timeToStatus":54.670812,"timeTotal":54.670929,"timeWriteBody":0.000117}

Related issues

Related to Arvados - Bug #16133: [controller] add loop prevention to federation lookups in new code pathResolved02/06/2020

History

#1 Updated by Ward Vandewege 5 months ago

  • Description updated (diff)

#2 Updated by Ward Vandewege 5 months ago

  • Related to Bug #16133: [controller] add loop prevention to federation lookups in new code path added

#3 Updated by Tom Clegg 5 months ago

Suspect the file descriptor issue was really caused by #16133 and the problem here is just that a slow peer causes all requests that need an answer from all remotes (like getting a collection by PDH that no other peer has) to be slow.

#4 Updated by Peter Amstutz 5 months ago

  • Status changed from New to Resolved

Confirmed side effect that went away when #16133 was fixed.

#5 Updated by Peter Amstutz 5 months ago

  • Release set to 22

Also available in: Atom PDF