Bug #20425
closedInappropriate auto-retry when looking up collection by PDH
Description
- If the remote clusters reject the token with 401, keep-web seems to enter a sleep-and-retry loop (possibly auto-retry from #19972) even though
- 401 shouldn't be retryable
- keep-web shouldn't be auto-retrying these remote cluster requests -- at least not for 5 minutes
- keep-web log indicates the request stayed alive for 5 minutes even if the client times out and hangs up after 20 seconds, suggesting the request context is not being propagated/checked properly.
Encountered while debugging #19889 / #18790 (the relevant collection was trashed).
{
"ClusterID": "tordo",
"PID": 25591,
"RequestID": "req-1bo44d9wucz2q1q91ji0",
"level": "info",
"msg": "response",
"remoteAddr": "10.253.0.44:59982",
"reqBytes": 0,
"reqForwardedFor": "12.34.56.78, 127.0.0.1",
"reqHost": "2eae77cc83224941210ecef3e1e1dcd6-596.internal",
"reqMethod": "OPTIONS",
"reqPath": "arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/",
"reqQuery": "",
"respBody": "request failed: https://tordo.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596: 502 Bad Gateway: errors: [request failed: https://jutro.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=jutro-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v) request failed: https://pirca.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=pirca-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v)]\n",
"respBytes": 757,
"respStatus": "Internal Server Error",
"respStatusCode": 500,
"time": "2023-04-10T19:54:28.189688357Z",
"timeToStatus": 300.01849,
"timeTotal": 300.018498,
"timeWriteBody": 8e-06
}
It looks like the reason 401 was handled as retryable is that controller returns 500 when the remote clusters all respond 401 (token not accepted).
- In this specific case, controller should return 404
- In general, when controller does a "try all remotes" and all remotes respond 4xx, controller should respond 4xx (422?) rather than 5xx.
Your branch for this ticket should revert 38bae05cc. That test should work as before after your changes.
Updated by Peter Amstutz over 1 year ago
I feel like we ran into something like this in the past, don't know if it was fixed, where certain federation error conditions were reported incorrectly.
Updated by Brett Smith over 1 year ago
- Related to Bug #12684: Let user specify a retry strategy on the client object, used for all API calls added
Updated by Tom Clegg over 1 year ago
- Description updated (diff)
Controller logs show the requests coming from keep-web at these times:
$ zgrep req-vmb9quhf11z51pmznm3v /var/log/arvados/20230410-arvados-controller.log.11.gz | grep '"request"' | jq .time "2023-04-10T19:49:28.187992569Z" "2023-04-10T19:49:29.237081095Z" "2023-04-10T19:49:31.279749327Z" "2023-04-10T19:49:35.320735146Z" "2023-04-10T19:49:43.374702382Z" "2023-04-10T19:49:59.416408965Z" "2023-04-10T19:50:29.459708263Z" "2023-04-10T19:50:59.502499590Z" "2023-04-10T19:51:29.544149362Z" "2023-04-10T19:51:59.589114689Z" "2023-04-10T19:52:29.633098321Z" "2023-04-10T19:52:59.677319460Z" "2023-04-10T19:53:29.720274015Z" "2023-04-10T19:53:59.764863571Z"
Updated by Tom Clegg over 1 year ago
20425-fed-pdh-retry @ e473db8ff60533953ac69f97f2f8c018e487671e
No fix yet, just a failing test checking for a 4xx response when all upstreams return 4xx.
Updated by Tom Clegg over 1 year ago
20425-fed-pdh-retry @ fb5de708b8e1466d2e541a618b9eca7c92a7d6f2 -- developer-run-tests: #3695
When looking up a collection by PDH, return 422 (not 502) if local returns 404, all remotes return non-retryable errors, and not all errors are 404.
Updated by Tom Clegg over 1 year ago
20425-fed-pdh-retry @ 5240e4ddb64c6064a58ded1673bed84d95288a3a -- developer-run-tests: #3696
developer-run-tests-apps-workbench-integration: #3992
Decrease timeout from 5m to 1m for keep-web → controller calls.
(Cancelling them when the caller hangs up would be ideal, but is not that easy, because of the way sessions are shared by concurrent keep-web transactions.)
Updated by Peter Amstutz over 1 year ago
- Target version changed from Future to Development 2023-07-05 sprint
Updated by Brett Smith over 1 year ago
Tom Clegg wrote in #note-10:
20425-fed-pdh-retry @ 5240e4ddb64c6064a58ded1673bed84d95288a3a -- developer-run-tests: #3696
developer-run-tests-apps-workbench-integration: #3992
This looks good to me, please merge. Sorry for the delay.
Updated by Tom Clegg over 1 year ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|aa555c48fce3c4fa849e672cf6c0613eba0fc235.