Actions
Bug #20425
closedInappropriate auto-retry when looking up collection by PDH
Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto
Description
When handling a "GET /arvados/v1/collections/{pdh}" request (from keep-web) for a collection that doesn't exist locally, controller tries retrieving it from remote clusters.
- If the remote clusters reject the token with 401, keep-web seems to enter a sleep-and-retry loop (possibly auto-retry from #19972) even though
- 401 shouldn't be retryable
- keep-web shouldn't be auto-retrying these remote cluster requests -- at least not for 5 minutes
- keep-web log indicates the request stayed alive for 5 minutes even if the client times out and hangs up after 20 seconds, suggesting the request context is not being propagated/checked properly.
Encountered while debugging #19889 / #18790 (the relevant collection was trashed).
{
"ClusterID": "tordo",
"PID": 25591,
"RequestID": "req-1bo44d9wucz2q1q91ji0",
"level": "info",
"msg": "response",
"remoteAddr": "10.253.0.44:59982",
"reqBytes": 0,
"reqForwardedFor": "12.34.56.78, 127.0.0.1",
"reqHost": "2eae77cc83224941210ecef3e1e1dcd6-596.internal",
"reqMethod": "OPTIONS",
"reqPath": "arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/",
"reqQuery": "",
"respBody": "request failed: https://tordo.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596: 502 Bad Gateway: errors: [request failed: https://jutro.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=jutro-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v) request failed: https://pirca.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=pirca-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v)]\n",
"respBytes": 757,
"respStatus": "Internal Server Error",
"respStatusCode": 500,
"time": "2023-04-10T19:54:28.189688357Z",
"timeToStatus": 300.01849,
"timeTotal": 300.018498,
"timeWriteBody": 8e-06
}
It looks like the reason 401 was handled as retryable is that controller returns 500 when the remote clusters all respond 401 (token not accepted).
- In this specific case, controller should return 404
- In general, when controller does a "try all remotes" and all remotes respond 4xx, controller should respond 4xx (422?) rather than 5xx.
Your branch for this ticket should revert 38bae05cc. That test should work as before after your changes.
Related issues
Actions