Actions
Bug #20425
openInappropriate auto-retry when looking up collection by PDH
Start date:
Due date:
% Done:
0%
Estimated time:
Story points:
-
Description
When handling a "GET /arvados/v1/collections/{pdh}" request (from keep-web) for a collection that doesn't exist locally, controller tries retrieving it from remote clusters.
- If the remote clusters reject the token with 401, keep-web seems to enter a sleep-and-retry loop (possibly auto-retry from #19972) even though
- 401 shouldn't be retryable
- keep-web shouldn't be auto-retrying these remote cluster requests -- at least not for 5 minutes
- keep-web log indicates the request stayed alive for 5 minutes even if the client times out and hangs up after 20 seconds, suggesting the request context is not being propagated/checked properly.
Encountered while debugging #19889 / #18790 (the relevant collection was trashed).
{
"ClusterID": "tordo",
"PID": 25591,
"RequestID": "req-1bo44d9wucz2q1q91ji0",
"level": "info",
"msg": "response",
"remoteAddr": "10.253.0.44:59982",
"reqBytes": 0,
"reqForwardedFor": "12.34.56.78, 127.0.0.1",
"reqHost": "2eae77cc83224941210ecef3e1e1dcd6-596.internal",
"reqMethod": "OPTIONS",
"reqPath": "arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/",
"reqQuery": "",
"respBody": "request failed: https://tordo.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596: 502 Bad Gateway: errors: [request failed: https://jutro.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=jutro-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v) request failed: https://pirca.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=pirca-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v)]\n",
"respBytes": 757,
"respStatus": "Internal Server Error",
"respStatusCode": 500,
"time": "2023-04-10T19:54:28.189688357Z",
"timeToStatus": 300.01849,
"timeTotal": 300.018498,
"timeWriteBody": 8e-06
}
It looks like the reason 401 was handled as retryable is that controller returns 500 when the remote clusters all respond 401 (token not accepted).
- In this specific case, controller should return 404
- In general, when controller does a "try all remotes" and all remotes respond 4xx, controller should respond 4xx (422?) rather than 5xx.
Your branch for this ticket should revert 38bae05cc. That test should work as before after your changes.
Related issues
Updated by Peter Amstutz about 1 month ago
I feel like we ran into something like this in the past, don't know if it was fixed, where certain federation error conditions were reported incorrectly.
Updated by Brett Smith 23 days ago
- Related to Bug #12684: Let user specify a retry strategy on the client object, used for all API calls added
Updated by Tom Clegg 17 days ago
- Description updated (diff)
Controller logs show the requests coming from keep-web at these times:
$ zgrep req-vmb9quhf11z51pmznm3v /var/log/arvados/20230410-arvados-controller.log.11.gz | grep '"request"' | jq .time "2023-04-10T19:49:28.187992569Z" "2023-04-10T19:49:29.237081095Z" "2023-04-10T19:49:31.279749327Z" "2023-04-10T19:49:35.320735146Z" "2023-04-10T19:49:43.374702382Z" "2023-04-10T19:49:59.416408965Z" "2023-04-10T19:50:29.459708263Z" "2023-04-10T19:50:59.502499590Z" "2023-04-10T19:51:29.544149362Z" "2023-04-10T19:51:59.589114689Z" "2023-04-10T19:52:29.633098321Z" "2023-04-10T19:52:59.677319460Z" "2023-04-10T19:53:29.720274015Z" "2023-04-10T19:53:59.764863571Z"
Actions