Project

General

Profile

Bug #20425

Updated by Tom Clegg over 1 year ago

When handling a "GET /arvados/v1/collections/{pdh}" request (from keep-web) for a collection that doesn't exist locally, controller tries retrieving it from remote clusters. 
 * If the remote clusters reject the token with 401, keep-web seems to enter a sleep-and-retry loop (possibly auto-retry from #19972) even though 
 ** 401 shouldn't be retryable 
 ** keep-web shouldn't be auto-retrying these remote cluster requests -- at least not for 5 minutes 
 * keep-web log indicates the request stayed alive for 5 minutes even if the client times out and hangs up after 20 seconds, suggesting the request context is not being propagated/checked properly. 

 Encountered while debugging #19889 / #18790 (the relevant collection was trashed). 

 <pre><code <code class="json"> 
 { 
   "ClusterID": "tordo", 
   "PID": 25591, 
   "RequestID": "req-1bo44d9wucz2q1q91ji0", 
   "level": "info", 
   "msg": "response", 
   "remoteAddr": "10.253.0.44:59982", 
   "reqBytes": 0, 
   "reqForwardedFor": "12.34.56.78, 127.0.0.1", 
   "reqHost": "2eae77cc83224941210ecef3e1e1dcd6-596.internal", 
   "reqMethod": "OPTIONS", 
   "reqPath": "arvados/v1/containers/tordo-dz642-dp0uvk7d13zo1t1/log/", 
   "reqQuery": "", 
   "respBody": "request failed: https://tordo.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596: 502 Bad Gateway: errors: [request failed: https://jutro.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=jutro-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v) request failed: https://pirca.arvadosapi.com/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=tordo-: 401 Unauthorized: //railsapi.internal/arvados/v1/collections/2eae77cc83224941210ecef3e1e1dcd6+596?forwarded_for=pirca-tordo-: 401 Unauthorized: Not logged in (req-vmb9quhf11z51pmznm3v)]\n", 
   "respBytes": 757, 
   "respStatus": "Internal Server Error", 
   "respStatusCode": 500, 
   "time": "2023-04-10T19:54:28.189688357Z", 
   "timeToStatus": 300.01849, 
   "timeTotal": 300.018498, 
   "timeWriteBody": 8e-06 
 } 
 </code></pre> 

 It looks like the reason 401 was handled as retryable is that controller returns 500 when the remote clusters all respond 401 (token not accepted). 
 * In this specific case, controller should return 404 
 * In general, when controller does a "try all remotes" and all remotes respond 4xx, controller should respond 4xx (422?) rather than 5xx. 
 </code>

Back