Project

General

Profile

Actions

Bug #20425

closed

Inappropriate auto-retry when looking up collection by PDH

Added by Tom Clegg 12 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Story points:
-
Release relationship:
Auto

Description

When handling a "GET /arvados/v1/collections/{pdh}" request (from keep-web) for a collection that doesn't exist locally, controller tries retrieving it from remote clusters.
  • If the remote clusters reject the token with 401, keep-web seems to enter a sleep-and-retry loop (possibly auto-retry from #19972) even though
    • 401 shouldn't be retryable
    • keep-web shouldn't be auto-retrying these remote cluster requests -- at least not for 5 minutes
  • keep-web log indicates the request stayed alive for 5 minutes even if the client times out and hangs up after 20 seconds, suggesting the request context is not being propagated/checked properly.

Encountered while debugging #19889 / #18790 (the relevant collection was trashed).

{
  "ClusterID": "tordo",
  "PID": 25591,
  "RequestID": "req-1bo44d9wucz2q1q91ji0",
  "level": "info",
  "msg": "response",
  "remoteAddr": "10.253.0.44:59982",
  "reqBytes": 0,
  "reqForwardedFor": "12.34.56.78, 127.0.0.1",
  "reqHost": "2eae77cc83224941210ecef3e1e1dcd6-596.internal",
  "reqMethod": "OPTIONS",
  "reqPath": "arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/",
  "reqQuery": "",
  "respBody": "request failed: https://tordo.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596: 502 Bad Gateway: errors: [request failed: https://jutro.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=jutro-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v) request failed: https://pirca.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=pirca-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v)]\n",
  "respBytes": 757,
  "respStatus": "Internal Server Error",
  "respStatusCode": 500,
  "time": "2023-04-10T19:54:28.189688357Z",
  "timeToStatus": 300.01849,
  "timeTotal": 300.018498,
  "timeWriteBody": 8e-06
}
It looks like the reason 401 was handled as retryable is that controller returns 500 when the remote clusters all respond 401 (token not accepted).
  • In this specific case, controller should return 404
  • In general, when controller does a "try all remotes" and all remotes respond 4xx, controller should respond 4xx (422?) rather than 5xx.

Your branch for this ticket should revert 38bae05cc. That test should work as before after your changes.


Subtasks 1 (0 open1 closed)

Task #20636: Review 20425-fed-pdh-retryResolvedTom Clegg06/09/2023Actions

Related issues

Related to Arvados - Bug #12684: Let user specify a retry strategy on the client object, used for all API callsResolvedBrett Smith05/09/2023Actions
Actions

Also available in: Atom PDF