Project

General

Profile

Actions

Bug #20750

closed

collection sharing tokens shouldn't leak account info of the person sharing (user/currrent)

Added by Peter Amstutz 10 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Story points:
3.0
Release relationship:
Auto

Description

Serving files through keep web requires the scopes documented at #20249

However it would be much better if it did not require

"GET /arvados/v1/users/current"

Because that means the sharing link can be used to leak personal information about the person sharing it -- their name, email address, any profile information stored on the user record, etc.

Either the relevant keep-web (or controller) requests should not require users/current, or we should introduce a new API call which returns only the minimum information and use that.

If the primary use of the endpoint is to determine either if the token is valid, or get just the user uuid that is associated with the token, we can already do that with api_client_authorization/current.

Update

The problem is specific to login cluster federation.

api_client_authorization#validate invokes users/current on the login cluster to get the user record.

So it is not just a matter of "just" checking that the token is valid.

Potentially we could rearrange the validation to check the token first, then check if the user record already exists in the local db, then only if the user is unknown or inactive do we try to query for the user record.


Subtasks 1 (0 open1 closed)

Task #20816: Review 20750-sharing-token-scopesResolvedBrett Smith08/24/2023Actions

Related issues

Related to Arvados - Bug #19933: Create working sharing URLs in login federationClosedPeter AmstutzActions
Related to Arvados - Bug #21617: Timeout error reading content from collection on a remote clusterResolvedTom CleggActions
Precedes Arvados - Idea #20927: When validating a remote token, translate transient network errors to a 5xx statusNew08/25/202308/25/2023Actions
Actions #1

Updated by Peter Amstutz 10 months ago

  • Related to Bug #19933: Create working sharing URLs in login federation added
Actions #3

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 10 months ago

  • Subject changed from collection sharing tokens shouldn't leak account info of the person sharing to collection sharing tokens shouldn't leak account info of the person sharing (user/currrent)
Actions #5

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)

Here

Actions #6

Updated by Peter Amstutz 10 months ago

  • Target version changed from Future to Development 2023-08-02 sprint
Actions #7

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Actions #8

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2023-08-16 to Development 2023-08-30
Actions #9

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2023-08-30 to Development 2023-08-16
Actions #10

Updated by Peter Amstutz 9 months ago

  • Story points set to 3.0
Actions #11

Updated by Peter Amstutz 9 months ago

  • Assigned To set to Brett Smith
Actions #12

Updated by Peter Amstutz 9 months ago

From discussion:

want to rearrange the validate method to

  1. get /arvados/v1/api_client_authorizations/current first
  2. if this doesn't return a 200 result, fail
  3. test if a local database record exists with owner_uuid from the token record
  4. examine the scopes and see if it's possible to fetch user/current, and if so, fetch it
  5. if it's not possible to fetch the remote user, check if there is a local user record already
  6. if there is no remote record and no local user record, fail
  7. if there is a remote record, create/update the local record
  8. if there is no remote record but there is a local record, proceed without updating the local record

second idea from chat:

Make GET /arvados/v1/api_client_authorizations/current a special case that isn't subject to scopes, so if you have a valid token you can always get the token metadata, even if it isn't listed in scopes. Combined with the main fix proposed here, this would make sharing work in login cluster configurations without requiring changes to documentation or code.

Actions #13

Updated by Peter Amstutz 9 months ago

  • Related to deleted (Bug #19933: Create working sharing URLs in login federation)
Actions #14

Updated by Peter Amstutz 9 months ago

  • Blocks Bug #19933: Create working sharing URLs in login federation added
Actions #15

Updated by Peter Amstutz 9 months ago

  • Release set to 66
Actions #16

Updated by Brett Smith 9 months ago

  • Status changed from New to In Progress

I have a branch with "tokens can always fetch themselves" implemented, tested, and documented, and I know what I want to do for the rest of the implementation, I'm just less sure how to test it.

Actions #17

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2023-08-16 to Development 2023-08-30
Actions #18

Updated by Brett Smith 9 months ago

20750-sharing-token-scopes @ cd669fe0345a9cbc09973ae957b97a0715364488 - developer-run-tests: #3783

My implementation is slightly different from the steps sketched out. Instead of examining scopes to decide whether or not to fetch the remote user record, we always try to fetch it, and handle any resulting error as needed. This meets the functional requirements with less code, and it means we also do the right thing in other error cases like intermittent network failure.

So the steps are now more like:

  1. Get the current token (no longer subject to scopes)
  2. If that fails, propagate the error
  3. Try to load a local user record from the token's owner_uuid
  4. Try to load a remove user record
  5. [omitted, was already kinda redundant with step 3]
  6. If we got neither a local nor remote user record, fail
  7. If we did not get a remote user record, use the existing local record as-is, without updating it
  8. If we got both, proceed with creating/updating the local record as before
Actions #19

Updated by Brett Smith 9 months ago

FAIL: integration_test.go:883: IntegrationSuite.TestListUsers

user token: "v2/z1111-gj3su-69clx38l9gx0e48/qyli4tcc18e3u2mrmj8douorcim2ho4fbvee50o61780g8dua" 
user UUID: "z1111-tpzed-0ijasr0s7xwtaxs" 
integration_test.go:948:
    c.Assert(err, check.ErrorMatches, `.*401 Unauthorized.*`)
... error string = "request failed: https://127.0.0.33:36909/arvados/v1/users/current: 502 Bad Gateway: //railsapi.internal/arvados/v1/users/current: 502 Bad Gateway" 
... regex string = ".*401 Unauthorized.*"

This is a real failure. The test deactivates a user on their home cluster, then makes an API request using their token on another cluster. Under the old logic, ApiClientAuthorization::validate would fail and return 401. Now it succeeds and lets the request through because everything looks good in the local database, but then the actual remote API request fails, and that turns into 502. Looking at it to see if there's somewhere else we can propagate status codes.

Actions #20

Updated by Brett Smith 9 months ago

Brett Smith wrote in #note-19:

Under the old logic, ApiClientAuthorization::validate would fail and return 401. Now it succeeds and lets the request through because everything looks good in the local database, but then the actual remote API request fails, and that turns into 502.

This was misdiagnosed. The real problem was that the exception did not get propagated into an HTTP error the way I expected because it never went through ApplicationController. The failing call to ApiClientAuthorization::validate was in CurrentApiToken#call, which happens before we ever get into the ActionDispatch stack.

Discussed at standup and agreed it made sense to add exception handling here along the same lines of what ApplicationController#render_error does. There are other places in API server that call ApiClientAuthorization::validate, so having it raised exceptions that can be handled well in both places is still useful.

20750-sharing-token-scopes @ afac7abd6f7ee614cd1bbe7413d196f7c149387e - developer-run-tests: #3792

wb1 tests that passed prior to the one-character documentation bugfix commit: developer-run-tests-apps-workbench-integration: #4099

Actions #21

Updated by Tom Clegg 8 months ago

Does a network problem while checking the remote cluster, say "no route to host", return a 5xx? (I'm suspicious of the code that classifies errors as "remote" based on http_status.)

What happens when the remote cluster is older than this version? I'm guessing the only thing that would break is a token that used to work because it explicitly had .../users/current in scope. Is this worth mentioning in upgrade notes?

While looking for (and not finding) places where we might have generated or recommended such tokens, I found that the exception added here in source:services/api/app/models/api_client_authorization.rb@afac7abd6f7ee614cd1bbe7413d196f7c149387e#L82 is similar to scope-checking / exceptions that already exist in source:services/api/app/middlewares/arvados_api_token.rb@main#L28 and I'm wondering whether we should be removing some redundant/duplicated code here. Or did you already look at that and decide not to?

(It looks like the middleware code is trying to only allow api_client_authorizations/current when the client is a remote cluster looking to validate a salted token, while the new code here is trying to allow any client to do that.)

Actions #22

Updated by Peter Amstutz 8 months ago

  • Blocks deleted (Bug #19933: Create working sharing URLs in login federation)
Actions #23

Updated by Peter Amstutz 8 months ago

  • Related to Bug #19933: Create working sharing URLs in login federation added
Actions #24

Updated by Peter Amstutz 8 months ago

  • Target version changed from Development 2023-08-30 to Development 2023-09-13 sprint
Actions #25

Updated by Brett Smith 8 months ago

Tom Clegg wrote in #note-21:

Does a network problem while checking the remote cluster, say "no route to host", return a 5xx? (I'm suspicious of the code that classifies errors as "remote" based on http_status.)

You're right, this is mistaken, I got it confused with some other dynamic language HTTP library. Ruby's httpclient raises the socket errors like ENETUNREACH like you would expect.

I am happy to rebase and correct the commit messages if that works for you. I am also happy to make a ticket to improve our error handling here. I would rather not do the actual improvement in this branch just because it's more scope. Most of the time, these errors will be handled the same way they are in the main branch today: they'll be caught by the generic rescue block and the method will return nil to indicate no authorization. The only exception is: if getting the current user fails this way, but we have a cached user in the database, the new code will fall back to using that instead. That seems okay both from a security perspective and from a ticket scope perspective.

What happens when the remote cluster is older than this version? I'm guessing the only thing that would break is a token that used to work because it explicitly had .../users/current in scope. Is this worth mentioning in upgrade notes?

I don't believe anything should break. The branch strictly relaxes the previous (undocumented) requirements. The only caveat is that you may not get the full benefit of this change until all API servers in a federation have been upgraded. I feel like that is (or at least should be?) standard expectations for federation and doesn't need a special call out.

The old logic to validate a remote token was, both /users/current and /api_client_authorizations/current had to succeed. This implicitly means that for any remote token to be validated, it has to be scoped for both those endpoints.

The new logic to validate a remote token is, /api_client_authorizations/current must succeed (and this no longer requires an explicit token scope), and a user record must be available (via /users/current if nothing else).

Any token that could be remotely validated under the old rules can still be validated under the new rules. It may have some no-longer-necessary scopes, but that's it.

If you have a token without /api_client_authorizations/current in scope, and you try to authorize to a post-20750 server, and it tries to remotely validate it against a pre-20750 server, the only change I think you'll see is that the error code changes from 401 (the generic "couldn't validate your token" code returned pre-branch) to 403 (reflecting that your token is valid but does not have the reuqired scope, now forwarded from the remote server to the client thanks to the middleware changes in this branch).

While looking for (and not finding) places where we might have generated or recommended such tokens, I found that the exception added here in source:services/api/app/models/api_client_authorization.rb@afac7abd6f7ee614cd1bbe7413d196f7c149387e#L82 is similar to scope-checking / exceptions that already exist in source:services/api/app/middlewares/arvados_api_token.rb@main#L28 and I'm wondering whether we should be removing some redundant/duplicated code here. Or did you already look at that and decide not to?

There is some functional overlap here, and maybe a refactoring opportunity, but they don't quite do the same thing. The code currently in main basically gates whether or not this RailsAPI server considers the request's remote parameter: it only does so for these auth-related API endpoints. My code adds a similar check to ApiClientAuthorization#scopes_allow_request? because I believe that's the most reliable way to implement this story's functional requirement of "token can always GET its current itself." Note this requirement holds for both local and remote requests, so it's not obvious how to hook into something remote-specific.

I have not thought about this deeply, but, the current conditional you're pointing to seems to be one of a theme I noticed in ApiClientAuthorizations::validate, where we do a lot more to check the validity and consistency of received data than most Arvados code does. On the one hand, I get it: this is security-sensitive code, more checks prevent accidental authorization where none should occur. But they're weirdly inconsistent: we check the types of some API results but not others; for consistent cluster IDs in some API results but not others; etc. To me this check feels like that: it's fine as it goes, but it also feels kind of "random" without being part of a larger consistency enforcement story.

Actions #26

Updated by Tom Clegg 8 months ago

Brett Smith wrote in #note-25:

I am happy to rebase and correct the commit messages if that works for you

I don't object to that, although I'd be more inclined to work on comments than commit messages.

The old logic to validate a remote token was, both /users/current and /api_client_authorizations/current had to succeed. This implicitly means that for any remote token to be validated, it has to be scoped for both those endpoints.

Ah. This is the part I hadn't taken into account. I was thinking that a token that allowed /users/current but not /api_client_authorizations/current would stop working. But given this point, it already wouldn't have worked, so there's no issue.

There is some functional overlap here, and maybe a refactoring opportunity, but they don't quite do the same thing. [...]

This all makes sense / answers my question.

[...] fine as it goes, but it also feels kind of "random" without being part of a larger consistency enforcement story.

Agree, I don't think we need to be creeping into that effort here.

LGTM, thanks for the additional explanations.

Actions #27

Updated by Brett Smith 8 months ago

  • Precedes Idea #20927: When validating a remote token, translate transient network errors to a 5xx status added
Actions #28

Updated by Brett Smith 8 months ago

Tom Clegg wrote in #note-26:

I don't object to that, although I'd be more inclined to work on comments than commit messages.

Added in 73931b1077dbf1f6cb195dfc427b1a7a6fe89a33 before merge.

Actions #29

Updated by Brett Smith 8 months ago

  • Status changed from In Progress to Resolved
Actions #30

Updated by Tom Clegg about 1 month ago

  • Related to Bug #21617: Timeout error reading content from collection on a remote cluster added
Actions

Also available in: Atom PDF