Bug #20425
Updated by Tom Clegg over 1 year ago
When handling a "GET /arvados/v1/collections/{pdh}" request (from keep-web) for a collection that doesn't exist locally, controller tries retrieving it from remote clusters. * If the remote clusters reject the token with 401, keep-web seems to enter a sleep-and-retry loop (possibly auto-retry from #19972) even though ** 401 shouldn't be retryable ** keep-web shouldn't be auto-retrying these remote cluster requests -- at least not for 5 minutes * keep-web log indicates the request stayed alive for 5 minutes even if the client times out and hangs up after 20 seconds, suggesting the request context is not being propagated/checked properly. Encountered while debugging #19889 / #18790 (the relevant collection was trashed). <pre><code <code class="json"> { "ClusterID": "tordo", "PID": 25591, "RequestID": "req-1bo44d9wucz2q1q91ji0", "level": "info", "msg": "response", "remoteAddr": "10.253.0.44:59982", "reqBytes": 0, "reqForwardedFor": "12.34.56.78, 127.0.0.1", "reqHost": "2eae77cc83224941210ecef3e1e1dcd6-596.internal", "reqMethod": "OPTIONS", "reqPath": "arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/", "reqQuery": "", "respBody": "request failed: https://tordo.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596: 502 Bad Gateway: errors: [request failed: https://jutro.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=jutro-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v) request failed: https://pirca.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=pirca-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v)]\n", "respBytes": 757, "respStatus": "Internal Server Error", "respStatusCode": 500, "time": "2023-04-10T19:54:28.189688357Z", "timeToStatus": 300.01849, "timeTotal": 300.018498, "timeWriteBody": 8e-06 } </code></pre> It looks like the reason 401 was handled as retryable is that controller returns 500 when the remote clusters all respond 401 (token not accepted). * In this specific case, controller should return 404 * In general, when controller does a "try all remotes" and all remotes respond 4xx, controller should respond 4xx (422?) rather than 5xx. </code>