Project

General

Profile

Actions

Bug #21617

closed

Timeout error reading content from collection on a remote cluster

Added by Tom Clegg 9 months ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
-
Release relationship:
Auto

Description

In a 3-way federation with login cluster z1111:
  • a collection stored on z1111 can be read from z2222 (e.g., workbench.z2222/collections/z1111-4zz18-...)
  • a collection stored on z2222 cannot be read from z1111 (timeout)
  • a collection stored on z2222 cannot be read from z3333 (timeout)

It looks like the intermediate cluster's keepstore process cannot retrieve the list of keep services from the cluster where the data is stored ("failed to validate remote token") -- this auto-retries in the background for a while, then eventually blockReadRemote gives up.

Manual testing, with jutro/tordo/pirca playing the roles of z1111/z2222/z3333, indicates the same problem existed before and after #2960 was merged and deployed to tordo.


Subtasks 1 (0 open1 closed)

Task #21619: Review 21617-fed-contentResolvedTom Clegg03/29/2024Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #20750: collection sharing tokens shouldn't leak account info of the person sharing (user/currrent)ResolvedBrett Smith08/24/2023Actions
Actions #1

Updated by Tom Clegg 9 months ago

This starts working if keepstore stops sending the "xxx" placeholder token with the "GET /arvados/v1/keep_services/accessible" request. In the context of that API call:
  • When z2222's keepstore sends a bad token "xxx" to z1111, z1111's controller's OIDC authorizer discovers it's not a valid OIDC token and passes it through to rails, where ApiClientAuthorization.validate() discovers it's not a valid token in the database and returns nil, and the middleware continues processing the request as an anonymous/unauthenticated request. The keep_services/accessible endpoint does not require authentication, so the request succeeds.
  • When z1111 or z3333's keepstore sends a bad token "xxx" to z2222, z2222's controller passes it through to rails, where ApiClientAuthorization.validate() calls back to z1111 (its login cluster) to check the token, receives an http 401 response, and raises an exception. The middleware catches that exception and returns an error to the caller.

IOW, at endpoints that don't require authentication, a login cluster accepts bad tokens but a satellite cluster does not.

It looks like this behavior changed in #20750 when refactoring the rails token-checking middleware.

Actions #2

Updated by Tom Clegg 9 months ago

21617-fed-content @ 3c210fe96edb1c345850e1eb35c93f98d205f843 -- developer-run-tests: #4102

This changes the remote token-checking behavior (when a token indicates that it's from a remote cluster, and when a LoginCluster is configured) to match the local token-checking behavior: when a remote cluster rejects a token, instead of failing the request immediately, we continue checking other provided tokens ("reader token"), and fail later with 401 only if no valid tokens are found and the API endpoint being accessed actually requires authentication.

Actions #3

Updated by Tom Clegg 9 months ago

  • Related to Bug #20750: collection sharing tokens shouldn't leak account info of the person sharing (user/currrent) added
Actions #5

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #6

Updated by Brett Smith 9 months ago

Tom Clegg wrote in #note-2:

21617-fed-content @ 3c210fe96edb1c345850e1eb35c93f98d205f843 -- developer-run-tests: #4102

I'll just make my usual nit and say, I think it would be nicer if the tests added to remote_user_test.rb were separate. This way the test name could explain what's important about the different login cluster settings and API endpoints, and why they're worth testing separately. But otherwise this lgtm, thanks.

Actions #7

Updated by Tom Clegg 9 months ago

  • Status changed from In Progress to Resolved
Actions #8

Updated by Peter Amstutz 9 months ago

  • Release set to 69
Actions

Also available in: Atom PDF