Bug #21617
closed
Timeout error reading content from collection on a remote cluster
Added by Tom Clegg 8 months ago.
Updated 8 months ago.
Release relationship:
Auto
Description
In a 3-way federation with login cluster z1111:
- a collection stored on z1111 can be read from z2222 (e.g., workbench.z2222/collections/z1111-4zz18-...)
- a collection stored on z2222 cannot be read from z1111 (timeout)
- a collection stored on z2222 cannot be read from z3333 (timeout)
It looks like the intermediate cluster's keepstore process cannot retrieve the list of keep services from the cluster where the data is stored ("failed to validate remote token") -- this auto-retries in the background for a while, then eventually blockReadRemote gives up.
Manual testing, with jutro/tordo/pirca playing the roles of z1111/z2222/z3333, indicates the same problem existed before and after #2960 was merged and deployed to tordo.
This starts working if keepstore stops sending the "xxx" placeholder token with the "GET /arvados/v1/keep_services/accessible" request. In the context of that API call:
- When z2222's keepstore sends a bad token "xxx" to z1111, z1111's controller's OIDC authorizer discovers it's not a valid OIDC token and passes it through to rails, where ApiClientAuthorization.validate() discovers it's not a valid token in the database and returns nil, and the middleware continues processing the request as an anonymous/unauthenticated request. The keep_services/accessible endpoint does not require authentication, so the request succeeds.
- When z1111 or z3333's keepstore sends a bad token "xxx" to z2222, z2222's controller passes it through to rails, where ApiClientAuthorization.validate() calls back to z1111 (its login cluster) to check the token, receives an http 401 response, and raises an exception. The middleware catches that exception and returns an error to the caller.
IOW, at endpoints that don't require authentication, a login cluster accepts bad tokens but a satellite cluster does not.
It looks like this behavior changed in #20750 when refactoring the rails token-checking middleware.
21617-fed-content @ 3c210fe96edb1c345850e1eb35c93f98d205f843 -- developer-run-tests: #4102
This changes the remote token-checking behavior (when a token indicates that it's from a remote cluster, and when a LoginCluster is configured) to match the local token-checking behavior: when a remote cluster rejects a token, instead of failing the request immediately, we continue checking other provided tokens ("reader token"), and fail later with 401 only if no valid tokens are found and the API endpoint being accessed actually requires authentication.
- Related to Bug #20750: collection sharing tokens shouldn't leak account info of the person sharing (user/currrent) added
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Tom Clegg wrote in #note-2:
21617-fed-content @ 3c210fe96edb1c345850e1eb35c93f98d205f843 -- developer-run-tests: #4102
I'll just make my usual nit and say, I think it would be nicer if the tests added to remote_user_test.rb
were separate. This way the test name could explain what's important about the different login cluster settings and API endpoints, and why they're worth testing separately. But otherwise this lgtm, thanks.
- Status changed from In Progress to Resolved
Also available in: Atom
PDF