Project

General

Profile

Actions

Bug #20425

closed

Inappropriate auto-retry when looking up collection by PDH

Added by Tom Clegg over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Story points:
-
Release relationship:
Auto

Description

When handling a "GET /arvados/v1/collections/{pdh}" request (from keep-web) for a collection that doesn't exist locally, controller tries retrieving it from remote clusters.
  • If the remote clusters reject the token with 401, keep-web seems to enter a sleep-and-retry loop (possibly auto-retry from #19972) even though
    • 401 shouldn't be retryable
    • keep-web shouldn't be auto-retrying these remote cluster requests -- at least not for 5 minutes
  • keep-web log indicates the request stayed alive for 5 minutes even if the client times out and hangs up after 20 seconds, suggesting the request context is not being propagated/checked properly.

Encountered while debugging #19889 / #18790 (the relevant collection was trashed).

{
  "ClusterID": "tordo",
  "PID": 25591,
  "RequestID": "req-1bo44d9wucz2q1q91ji0",
  "level": "info",
  "msg": "response",
  "remoteAddr": "10.253.0.44:59982",
  "reqBytes": 0,
  "reqForwardedFor": "12.34.56.78, 127.0.0.1",
  "reqHost": "2eae77cc83224941210ecef3e1e1dcd6-596.internal",
  "reqMethod": "OPTIONS",
  "reqPath": "arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/",
  "reqQuery": "",
  "respBody": "request failed: https://tordo.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596: 502 Bad Gateway: errors: [request failed: https://jutro.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=jutro-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v) request failed: https://pirca.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=pirca-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v)]\n",
  "respBytes": 757,
  "respStatus": "Internal Server Error",
  "respStatusCode": 500,
  "time": "2023-04-10T19:54:28.189688357Z",
  "timeToStatus": 300.01849,
  "timeTotal": 300.018498,
  "timeWriteBody": 8e-06
}
It looks like the reason 401 was handled as retryable is that controller returns 500 when the remote clusters all respond 401 (token not accepted).
  • In this specific case, controller should return 404
  • In general, when controller does a "try all remotes" and all remotes respond 4xx, controller should respond 4xx (422?) rather than 5xx.

Your branch for this ticket should revert 38bae05cc. That test should work as before after your changes.


Subtasks 1 (0 open1 closed)

Task #20636: Review 20425-fed-pdh-retryResolvedTom Clegg06/09/2023Actions

Related issues

Related to Arvados - Bug #12684: Let user specify a retry strategy on the client object, used for all API callsResolvedBrett Smith05/09/2023Actions
Actions #1

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 1 year ago

I feel like we ran into something like this in the past, don't know if it was fixed, where certain federation error conditions were reported incorrectly.

Actions #3

Updated by Brett Smith over 1 year ago

  • Related to Bug #12684: Let user specify a retry strategy on the client object, used for all API calls added
Actions #4

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)

Controller logs show the requests coming from keep-web at these times:

$ zgrep req-vmb9quhf11z51pmznm3v /var/log/arvados/20230410-arvados-controller.log.11.gz | grep '"request"' | jq .time
"2023-04-10T19:49:28.187992569Z" 
"2023-04-10T19:49:29.237081095Z" 
"2023-04-10T19:49:31.279749327Z" 
"2023-04-10T19:49:35.320735146Z" 
"2023-04-10T19:49:43.374702382Z" 
"2023-04-10T19:49:59.416408965Z" 
"2023-04-10T19:50:29.459708263Z" 
"2023-04-10T19:50:59.502499590Z" 
"2023-04-10T19:51:29.544149362Z" 
"2023-04-10T19:51:59.589114689Z" 
"2023-04-10T19:52:29.633098321Z" 
"2023-04-10T19:52:59.677319460Z" 
"2023-04-10T19:53:29.720274015Z" 
"2023-04-10T19:53:59.764863571Z" 
Actions #5

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
Actions #6

Updated by Brett Smith over 1 year ago

  • Description updated (diff)
Actions #7

Updated by Tom Clegg over 1 year ago

20425-fed-pdh-retry @ e473db8ff60533953ac69f97f2f8c018e487671e

No fix yet, just a failing test checking for a 4xx response when all upstreams return 4xx.

Actions #8

Updated by Tom Clegg over 1 year ago

  • Status changed from New to In Progress
Actions #9

Updated by Tom Clegg over 1 year ago

20425-fed-pdh-retry @ fb5de708b8e1466d2e541a618b9eca7c92a7d6f2 -- developer-run-tests: #3695

When looking up a collection by PDH, return 422 (not 502) if local returns 404, all remotes return non-retryable errors, and not all errors are 404.

Actions #10

Updated by Tom Clegg over 1 year ago

20425-fed-pdh-retry @ 5240e4ddb64c6064a58ded1673bed84d95288a3a -- developer-run-tests: #3696
developer-run-tests-apps-workbench-integration: #3992

Decrease timeout from 5m to 1m for keep-web → controller calls.

(Cancelling them when the caller hangs up would be ideal, but is not that easy, because of the way sessions are shared by concurrent keep-web transactions.)

Actions #11

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Future to Development 2023-07-05 sprint
Actions #12

Updated by Brett Smith over 1 year ago

Tom Clegg wrote in #note-10:

20425-fed-pdh-retry @ 5240e4ddb64c6064a58ded1673bed84d95288a3a -- developer-run-tests: #3696
developer-run-tests-apps-workbench-integration: #3992

This looks good to me, please merge. Sorry for the delay.

Actions #13

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Tom Clegg
Actions #14

Updated by Tom Clegg over 1 year ago

  • Status changed from In Progress to Resolved
Actions #15

Updated by Peter Amstutz about 1 year ago

  • Release set to 66
Actions

Also available in: Atom PDF