Feature #13493

Federated record retrieval

Added by Peter Amstutz over 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/28/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Description

From Federation implementation roadmap

When retrieving a workflow record for arvados-cwl-runner, API server notices workflow UUID is remote, and fetches it with salted token (instead of looking in local DB)

Implementation notes
  • This is a feature of the new (Go) API server (#13497)
  • Implement as http handler middleware ("proxy to remote cluster if applicable, otherwise fall through to next handler")
Configuration

Subtasks

Task #13660: Review 13493-federation-proxyResolvedTom Clegg

Task #14045: Document federation features / designResolvedPeter Amstutz

Task #14054: Review 13493-document-federationResolvedPeter Amstutz


Related issues

Related to Arvados - Story #9053: [Epic] Port APIs to GoNew

Related to Arvados - Feature #13993: [API] Fetch remote-hosted collection by UUIDResolved09/07/2018

Blocked by Arvados - Story #13497: [API] Initial "arvados-controller" server that proxies API endpoints to Rails serverResolved06/15/2018

Associated revisions

Revision b3b3d214
Added by Tom Clegg about 1 year ago

Merge branch '13493-federation-proxy'

refs #13493

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision ab92b051
Added by Peter Amstutz 12 months ago

Merge branch '13493-document-federation' refs #13493

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#3 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#5 Updated by Tom Clegg over 1 year ago

#6 Updated by Tom Clegg over 1 year ago

  • Blocked by Story #13497: [API] Initial "arvados-controller" server that proxies API endpoints to Rails server added

#7 Updated by Peter Amstutz over 1 year ago

  • Status changed from In Progress to New

#8 Updated by Tom Clegg over 1 year ago

  • Description updated (diff)

#9 Updated by Tom Morris over 1 year ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 2.0

#10 Updated by Tom Morris about 1 year ago

  • Target version changed from Arvados Future Sprints to 2018-07-03 Sprint

#11 Updated by Tom Clegg about 1 year ago

  • Assigned To set to Tom Clegg

#12 Updated by Tom Clegg about 1 year ago

  • Status changed from New to In Progress

#13 Updated by Tom Clegg about 1 year ago

13493-federation-proxy @ 95de13bdab14eb803a4c9e2243df50e1eb0df69a ... https://ci.curoverse.com/job/developer-run-tests/777/
  • Doesn't cover all possible permutations of {issuer of client auth, uuid of object being retrieved, cluster receiving proxyable request}. In the context of using arvados-cwl-runner on cluster A, loading a workflow from cluster B:
    • Works if your token was issued by cluster A (i.e., your user account is also on cluster A)
    • Works if your token was issued by cluster B, and has not been restricted to only work on cluster A (afaik there is no way to get a UI to offer this pair of host/token env vars)
    • Doesn't work if your token was issued by cluster B and has been restricted to only work on cluster A (e.g., by logging in to Workbench B, following a link to Workbench A, and using "current token" in the nav menu)
    • Doesn't work if your token was issued by cluster C (to make this work, API server needs to start issuing v2 tokens)
  • Doesn't support object types other than workflow -- the idea is to use federation to host workflows on cluster A and use them on clusters B..E, even though we haven't yet dealt with all the complications/implications of proxying other kinds of requests

The test suite is a bit counterintuitive: unlike most tests, "zzzzz" is the remote cluster, and "zhome" is the local cluster whose proxy behavior is being tested, so that the remote cluster can be hooked up to the Rails server provided by run-tests.sh, which is always "zzzzz".

#14 Updated by Peter Amstutz about 1 year ago

  • Missing support for looking up remote clusters by DNS.
  • SaltToken() returns ErrObsoleteToken for "plain" tokens, so it depends on the API server giving out "v2/" tokens (or the client converting them to 'v2' tokens) but I don't think the API server does that yet.
  • Decides if a token is salted or not based on only the token length, which seems likely to cause problems in the future. Could we tweak the v2 token format so there is a way to unambiguously indicate a token is salted?
  • Going a bit further, did we ever investigate using something like JSON Web Tokens (JWT) for Arvados tokens? This would provide a standard structured format for token payload and metadata.
  • It appears that if the token was provided in the query string as ?api_token=... it is not filtered out, and will be passed along in addition to the salted token in the Authorization header.

#15 Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote:

  • Missing support for looking up remote clusters by DNS.

Not sure whether/when we'll need this, but that's right, it's not included here.

  • SaltToken() returns ErrObsoleteToken for "plain" tokens, so it depends on the API server giving out "v2/" tokens (or the client converting them to 'v2' tokens) but I don't think the API server does that yet.

That's right, that's where the "cluster C" restriction above comes from.

  • Decides if a token is salted or not based on only the token length, which seems likely to cause problems in the future. Could we tweak the v2 token format so there is a way to unambiguously indicate a token is salted?

We could, although length==40 seems unambiguous. What kind of problems do you have in mind?

  • Going a bit further, did we ever investigate using something like JSON Web Tokens (JWT) for Arvados tokens? This would provide a standard structured format for token payload and metadata.

We could investigate that. But this is probably not the best place to open a discussion. Wiki?

  • It appears that if the token was provided in the query string as ?api_token=... it is not filtered out, and will be passed along in addition to the salted token in the Authorization header.

Indeed. Will fix.

#16 Updated by Tom Clegg about 1 year ago

  • Target version changed from 2018-07-03 Sprint to 2018-07-18 Sprint

#17 Updated by Tom Clegg about 1 year ago

  • Story points changed from 2.0 to 1.0

#18 Updated by Tom Clegg about 1 year ago

13493-federation-proxy @ 64a357242cc7ba9fb877a9b3d622160318098c5c https://ci.curoverse.com/job/developer-run-tests/782/
  • removes ?api_token=...
  • cleanup tests, add "mocked remote" for testing request details like the query string

#19 Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote:

Peter Amstutz wrote:

  • Missing support for looking up remote clusters by DNS.

Not sure whether/when we'll need this, but yes, it's missing.

The API server supports it already, so why wouldn't we do it now?

  • SaltToken() returns ErrObsoleteToken for "plain" tokens, so it depends on the API server giving out "v2/" tokens (or the client converting them to 'v2' tokens) but I don't think the API server does that yet.

That's right, that's where the "cluster C" restriction above comes from.

That doesn't address my comment, if you get a "classic" token from cluster A, you still can't use it for federation unless some as-yet-unwritten client code converts it to "v2/" format first. So it won't actually work?

  • Decides if a token is salted or not based on only the token length, which seems likely to cause problems in the future. Could we tweak the v2 token format so there is a way to unambiguously indicate a token is salted?

We could, although length==40 seems unambiguous. What kind of problems do you have in mind?

Because we don't specify the plain token is required to have a certain length that != 40, it would be legal to have an unsalted token that is 40 characters long. If we're going to rely on a particular behavior we should document it.

  • Going a bit further, did we ever investigate using something like JSON Web Tokens (JWT) for Arvados tokens? This would provide a standard structured format for token payload and metadata.

We could investigate that.

I'll do a little research and see if it makes sense for us.

#20 Updated by Peter Amstutz about 1 year ago

Also: please add a documentation page under "Architecture" that describes the current state of federation features.

#21 Updated by Peter Amstutz about 1 year ago

Also there should be a documentation page describing how to install & configure arvados-controller (even if it is marked EXPERIMENTAL).

#22 Updated by Peter Amstutz about 1 year ago

Comments about testing:

  • Looks like we test GET and PATCH, what about POST, DELETE, or weird methods like OPTIONS?
  • It looks like TestRemoteError tests for connection error, how about a test where the remote returns 404?
  • Maybe errors talking to clusters be returned as "Bad Gateway" not "Internal Server Error" ?

#23 Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote:

  • Missing support for looking up remote clusters by DNS.

The API server supports it already, so why wouldn't we do it now?

The API server supports accepting and validating tokens issued by *.arvadosapi.com. Here, a wildcard would support proxying requests to *.arvadosapi.com. Both are "looking up by DNS" but they are different features.

I guess the reason not to do it now would be that we have other things to do now. It might be better to think about this in the context of #13619.

  • SaltToken() returns ErrObsoleteToken for "plain" tokens, so it depends on the API server giving out "v2/" tokens (or the client converting them to 'v2' tokens) but I don't think the API server does that yet.

That's right, that's where the "cluster C" restriction above comes from.

That doesn't address my comment, if you get a "classic" token from cluster A, you still can't use it for federation unless some as-yet-unwritten client code converts it to "v2/" format first. So it won't actually work?

Aha. I thought "it" in your comment referred to SaltToken(), I didn't know you were thinking of the "token issued by A" scenario.

Indeed, for that to work, we need to either fix the FIXME in saltAuthToken(), or start issuing v2 tokens (and make sure Workbench recognizes/salts them properly).

       if err == auth.ErrObsoleteToken {
               // FIXME: If the token exists in our own database,
               // salt it for the remote. Otherwise, assume it was
               // issued by the remote, and pass it through
               // unmodified.
               token = creds.Tokens[0]

If we're going to rely on a particular behavior we should document it.

Agreed. Added API token format

#24 Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote:

Also there should be a documentation page describing how to install & configure arvados-controller (even if it is marked EXPERIMENTAL).

Yes, this is happening on #13497. Waiting for ops feedback about Installing controller service.

  • Looks like we test GET and PATCH, what about POST, DELETE, or weird methods like OPTIONS?

Will add.

  • It looks like TestRemoteError tests for connection error, how about a test where the remote returns 404?

TestNoAccess tests this (get object with no read access = 404).

  • Maybe errors talking to clusters be returned as "Bad Gateway" not "Internal Server Error" ?

Sounds good.

#25 Updated by Tom Clegg about 1 year ago

13493-federation-proxy @ 66fc74001e777ac7ceff2b02cfc459b1368f42f3 https://ci.curoverse.com/job/developer-run-tests/809/
  • upgrade locally-issued v1 tokens to v2 so they can be salted for proxying (it's still desirable to move to issuing v2 tokens, but I don't want that and the associated Workbench updates to block this -- also, even after we do that, the transparent upgrade will allow old tokens to work as well as new tokens)
  • test OPTIONS method (and updated apiserver to return an empty body instead of "-")

#26 Updated by Peter Amstutz about 1 year ago

s.remoteServer.Server.Handler.ServeHTTP(rec, req) // direct to remote -- can't proxy a create req because no uuid

You could look for workflow.owner_uuid and use the the cluster id from that. (Not going to block the branch on this, but we should think about it.)

        err = db.QueryRowContext(req.Context(), `SELECT uuid FROM api_client_authorizations WHERE api_token=$1 AND (expires_at IS NULL OR expires_at > current_timestamp) LIMIT 1`, aca.APIToken).Scan(&aca.UUID)

This line is really long

   def empty
-    render text: "-" 
+    render text: "" 
   end

This is awesome.

So far this looks good. I'm going to do some manual testing / playing around with arvbox.

#27 Updated by Tom Clegg about 1 year ago

  • Target version changed from 2018-07-18 Sprint to 2018-08-01 Sprint

#28 Updated by Peter Amstutz about 1 year ago

I'm happy to say after a bunch or arvbox hacking (adding support for arvados-controller) I was able to do a federated query between two arvbox instances! Both the "arv" command line and workbench were able to retrieve a record from a federated instance.

LGTM!

#29 Updated by Tom Clegg about 1 year ago

  • Target version changed from 2018-08-01 Sprint to 2018-08-15 Sprint

#30 Updated by Tom Clegg about 1 year ago

Peter Amstutz wrote:

Also: please add a documentation page under "Architecture" that describes the current state of federation features.

The architecture section seems like a good place to explain how federation works under the hood, but would it be more appropriate to put the "feature is available" info in a usage context?

I suppose these also have a dependency on docs for setting up the relevant parts of the cluster config and Rails API config.

#31 Updated by Peter Amstutz about 1 year ago

  • Related to Feature #13993: [API] Fetch remote-hosted collection by UUID added

#32 Updated by Tom Clegg about 1 year ago

  • Assigned To deleted (Tom Clegg)
  • Target version changed from 2018-08-15 Sprint to 2018-09-05 Sprint

#33 Updated by Tom Clegg about 1 year ago

  • Assigned To set to Peter Amstutz

#34 Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2018-09-05 Sprint to 2018-09-19 Sprint

#35 Updated by Peter Amstutz 12 months ago

13493-document-federation @ 5d8eb6ead9e2ca32f424a8385979485a902cf09f

First pass at documenting the mechanics of federation: cluster ids, salted tokens, proxying record & keep block requests.

#36 Updated by Tom Clegg 12 months ago

There's a lot of information here about how things have been implemented under the hood, which is fine if we can keep it up to date.

We should probably mention that most of the federation story isn't actually implemented/supported yet -- just some specific parts, and with certain simplifying assumptions -- and that the behaviors described here should not be interpreted as a stable API.

This seems like a good place to put some (badly needed) sensible rules for choosing a cluster ID.
  • For throw-away testing purposes, use "z****"
  • For experimental/local clusters that will never be referenced by the outside world, use "x****"
  • For long-lived clusters, contact us to get an ID that won't conflict with anyone else in the future

#37 Updated by Peter Amstutz 12 months ago

Tom Clegg wrote:

There's a lot of information here about how things have been implemented under the hood, which is fine if we can keep it up to date.

We should probably mention that most of the federation story isn't actually implemented/supported yet -- just some specific parts, and with certain simplifying assumptions -- and that the behaviors described here should not be interpreted as a stable API.

This seems like a good place to put some (badly needed) sensible rules for choosing a cluster ID.
  • For throw-away testing purposes, use "z****"
  • For experimental/local clusters that will never be referenced by the outside world, use "x****"
  • For long-lived clusters, contact us to get an ID that won't conflict with anyone else in the future

13493-document-federation @ 898cd195f8b5916ed9c3a83dec212e858bfa610f updated taking these comments into account, also adds missing diagram

#38 Updated by Tom Morris 12 months ago

  • Target version changed from 2018-09-19 Sprint to 2018-10-03 Sprint

#39 Updated by Tom Clegg 12 months ago

LGTM, thanks

#40 Updated by Peter Amstutz 12 months ago

  • Status changed from In Progress to Resolved

#41 Updated by Tom Morris 10 months ago

  • Release set to 14

Also available in: Atom PDF