Bug #21617
closedTimeout error reading content from collection on a remote cluster
Description
- a collection stored on z1111 can be read from z2222 (e.g., workbench.z2222/collections/z1111-4zz18-...)
- a collection stored on z2222 cannot be read from z1111 (timeout)
- a collection stored on z2222 cannot be read from z3333 (timeout)
It looks like the intermediate cluster's keepstore process cannot retrieve the list of keep services from the cluster where the data is stored ("failed to validate remote token") -- this auto-retries in the background for a while, then eventually blockReadRemote gives up.
Manual testing, with jutro/tordo/pirca playing the roles of z1111/z2222/z3333, indicates the same problem existed before and after #2960 was merged and deployed to tordo.
Related issues
Updated by Tom Clegg 8 months ago
- When z2222's keepstore sends a bad token "xxx" to z1111, z1111's controller's OIDC authorizer discovers it's not a valid OIDC token and passes it through to rails, where ApiClientAuthorization.validate() discovers it's not a valid token in the database and returns nil, and the middleware continues processing the request as an anonymous/unauthenticated request. The keep_services/accessible endpoint does not require authentication, so the request succeeds.
- When z1111 or z3333's keepstore sends a bad token "xxx" to z2222, z2222's controller passes it through to rails, where ApiClientAuthorization.validate() calls back to z1111 (its login cluster) to check the token, receives an http 401 response, and raises an exception. The middleware catches that exception and returns an error to the caller.
IOW, at endpoints that don't require authentication, a login cluster accepts bad tokens but a satellite cluster does not.
It looks like this behavior changed in #20750 when refactoring the rails token-checking middleware.
Updated by Tom Clegg 8 months ago
21617-fed-content @ 3c210fe96edb1c345850e1eb35c93f98d205f843 -- developer-run-tests: #4102
This changes the remote token-checking behavior (when a token indicates that it's from a remote cluster, and when a LoginCluster is configured) to match the local token-checking behavior: when a remote cluster rejects a token, instead of failing the request immediately, we continue checking other provided tokens ("reader token"), and fail later with 401 only if no valid tokens are found and the API endpoint being accessed actually requires authentication.
Updated by Tom Clegg 8 months ago
- Related to Bug #20750: collection sharing tokens shouldn't leak account info of the person sharing (user/currrent) added
Updated by Peter Amstutz 8 months ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Updated by Brett Smith 8 months ago
Tom Clegg wrote in #note-2:
21617-fed-content @ 3c210fe96edb1c345850e1eb35c93f98d205f843 -- developer-run-tests: #4102
I'll just make my usual nit and say, I think it would be nicer if the tests added to remote_user_test.rb
were separate. This way the test name could explain what's important about the different login cluster settings and API endpoints, and why they're worth testing separately. But otherwise this lgtm, thanks.
Updated by Tom Clegg 8 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|75cf6882418dc594e3ada42e433ccccd25435cac.