Project

General

Profile

Actions

Feature #13493

closed

Federated record retrieval

Added by Peter Amstutz almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release:
Release relationship:
Auto

Description

From Federation implementation roadmap

When retrieving a workflow record for arvados-cwl-runner, API server notices workflow UUID is remote, and fetches it with salted token (instead of looking in local DB)

Implementation notes
  • This is a feature of the new (Go) API server (#13497)
  • Implement as http handler middleware ("proxy to remote cluster if applicable, otherwise fall through to next handler")
Configuration

Subtasks 3 (0 open3 closed)

Task #13660: Review 13493-federation-proxyResolvedTom Clegg06/28/2018Actions
Task #14045: Document federation features / designResolvedPeter Amstutz06/28/2018Actions
Task #14054: Review 13493-document-federationResolvedPeter Amstutz06/28/2018Actions

Related issues

Related to Arvados Epics - Idea #9053: Port API server to GoNewActions
Related to Arvados - Feature #13993: [API] Fetch remote-hosted collection by UUIDResolvedPeter Amstutz09/07/2018Actions
Blocked by Arvados - Idea #13497: [API] Initial "arvados-controller" server that proxies API endpoints to Rails serverResolvedTom Clegg06/15/2018Actions
Actions #1

Updated by Peter Amstutz almost 6 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz almost 6 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz almost 6 years ago

  • Description updated (diff)
Actions #5

Updated by Tom Clegg almost 6 years ago

  • Related to Idea #9053: Port API server to Go added
Actions #6

Updated by Tom Clegg almost 6 years ago

  • Blocked by Idea #13497: [API] Initial "arvados-controller" server that proxies API endpoints to Rails server added
Actions #7

Updated by Peter Amstutz almost 6 years ago

  • Status changed from In Progress to New
Actions #8

Updated by Tom Clegg almost 6 years ago

  • Description updated (diff)
Actions #9

Updated by Tom Morris almost 6 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 2.0
Actions #10

Updated by Tom Morris almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2018-07-03 Sprint
Actions #11

Updated by Tom Clegg almost 6 years ago

  • Assigned To set to Tom Clegg
Actions #12

Updated by Tom Clegg almost 6 years ago

  • Status changed from New to In Progress
Actions #13

Updated by Tom Clegg over 5 years ago

13493-federation-proxy @ 95de13bdab14eb803a4c9e2243df50e1eb0df69a ... https://ci.curoverse.com/job/developer-run-tests/777/
  • Doesn't cover all possible permutations of {issuer of client auth, uuid of object being retrieved, cluster receiving proxyable request}. In the context of using arvados-cwl-runner on cluster A, loading a workflow from cluster B:
    • Works if your token was issued by cluster A (i.e., your user account is also on cluster A)
    • Works if your token was issued by cluster B, and has not been restricted to only work on cluster A (afaik there is no way to get a UI to offer this pair of host/token env vars)
    • Doesn't work if your token was issued by cluster B and has been restricted to only work on cluster A (e.g., by logging in to Workbench B, following a link to Workbench A, and using "current token" in the nav menu)
    • Doesn't work if your token was issued by cluster C (to make this work, API server needs to start issuing v2 tokens)
  • Doesn't support object types other than workflow -- the idea is to use federation to host workflows on cluster A and use them on clusters B..E, even though we haven't yet dealt with all the complications/implications of proxying other kinds of requests

The test suite is a bit counterintuitive: unlike most tests, "zzzzz" is the remote cluster, and "zhome" is the local cluster whose proxy behavior is being tested, so that the remote cluster can be hooked up to the Rails server provided by run-tests.sh, which is always "zzzzz".

Actions #14

Updated by Peter Amstutz over 5 years ago

  • Missing support for looking up remote clusters by DNS.
  • SaltToken() returns ErrObsoleteToken for "plain" tokens, so it depends on the API server giving out "v2/" tokens (or the client converting them to 'v2' tokens) but I don't think the API server does that yet.
  • Decides if a token is salted or not based on only the token length, which seems likely to cause problems in the future. Could we tweak the v2 token format so there is a way to unambiguously indicate a token is salted?
  • Going a bit further, did we ever investigate using something like JSON Web Tokens (JWT) for Arvados tokens? This would provide a standard structured format for token payload and metadata.
  • It appears that if the token was provided in the query string as ?api_token=... it is not filtered out, and will be passed along in addition to the salted token in the Authorization header.
Actions #15

Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

  • Missing support for looking up remote clusters by DNS.

Not sure whether/when we'll need this, but that's right, it's not included here.

  • SaltToken() returns ErrObsoleteToken for "plain" tokens, so it depends on the API server giving out "v2/" tokens (or the client converting them to 'v2' tokens) but I don't think the API server does that yet.

That's right, that's where the "cluster C" restriction above comes from.

  • Decides if a token is salted or not based on only the token length, which seems likely to cause problems in the future. Could we tweak the v2 token format so there is a way to unambiguously indicate a token is salted?

We could, although length==40 seems unambiguous. What kind of problems do you have in mind?

  • Going a bit further, did we ever investigate using something like JSON Web Tokens (JWT) for Arvados tokens? This would provide a standard structured format for token payload and metadata.

We could investigate that. But this is probably not the best place to open a discussion. Wiki?

  • It appears that if the token was provided in the query string as ?api_token=... it is not filtered out, and will be passed along in addition to the salted token in the Authorization header.

Indeed. Will fix.

Actions #16

Updated by Tom Clegg over 5 years ago

  • Target version changed from 2018-07-03 Sprint to 2018-07-18 Sprint
Actions #17

Updated by Tom Clegg over 5 years ago

  • Story points changed from 2.0 to 1.0
Actions #18

Updated by Tom Clegg over 5 years ago

13493-federation-proxy @ 64a357242cc7ba9fb877a9b3d622160318098c5c https://ci.curoverse.com/job/developer-run-tests/782/
  • removes ?api_token=...
  • cleanup tests, add "mocked remote" for testing request details like the query string
Actions #19

Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

Peter Amstutz wrote:

  • Missing support for looking up remote clusters by DNS.

Not sure whether/when we'll need this, but yes, it's missing.

The API server supports it already, so why wouldn't we do it now?

  • SaltToken() returns ErrObsoleteToken for "plain" tokens, so it depends on the API server giving out "v2/" tokens (or the client converting them to 'v2' tokens) but I don't think the API server does that yet.

That's right, that's where the "cluster C" restriction above comes from.

That doesn't address my comment, if you get a "classic" token from cluster A, you still can't use it for federation unless some as-yet-unwritten client code converts it to "v2/" format first. So it won't actually work?

  • Decides if a token is salted or not based on only the token length, which seems likely to cause problems in the future. Could we tweak the v2 token format so there is a way to unambiguously indicate a token is salted?

We could, although length==40 seems unambiguous. What kind of problems do you have in mind?

Because we don't specify the plain token is required to have a certain length that != 40, it would be legal to have an unsalted token that is 40 characters long. If we're going to rely on a particular behavior we should document it.

  • Going a bit further, did we ever investigate using something like JSON Web Tokens (JWT) for Arvados tokens? This would provide a standard structured format for token payload and metadata.

We could investigate that.

I'll do a little research and see if it makes sense for us.

Actions #20

Updated by Peter Amstutz over 5 years ago

Also: please add a documentation page under "Architecture" that describes the current state of federation features.

Actions #21

Updated by Peter Amstutz over 5 years ago

Also there should be a documentation page describing how to install & configure arvados-controller (even if it is marked EXPERIMENTAL).

Actions #22

Updated by Peter Amstutz over 5 years ago

Comments about testing:

  • Looks like we test GET and PATCH, what about POST, DELETE, or weird methods like OPTIONS?
  • It looks like TestRemoteError tests for connection error, how about a test where the remote returns 404?
  • Maybe errors talking to clusters be returned as "Bad Gateway" not "Internal Server Error" ?
Actions #23

Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

  • Missing support for looking up remote clusters by DNS.

The API server supports it already, so why wouldn't we do it now?

The API server supports accepting and validating tokens issued by *.arvadosapi.com. Here, a wildcard would support proxying requests to *.arvadosapi.com. Both are "looking up by DNS" but they are different features.

I guess the reason not to do it now would be that we have other things to do now. It might be better to think about this in the context of #13619.

  • SaltToken() returns ErrObsoleteToken for "plain" tokens, so it depends on the API server giving out "v2/" tokens (or the client converting them to 'v2' tokens) but I don't think the API server does that yet.

That's right, that's where the "cluster C" restriction above comes from.

That doesn't address my comment, if you get a "classic" token from cluster A, you still can't use it for federation unless some as-yet-unwritten client code converts it to "v2/" format first. So it won't actually work?

Aha. I thought "it" in your comment referred to SaltToken(), I didn't know you were thinking of the "token issued by A" scenario.

Indeed, for that to work, we need to either fix the FIXME in saltAuthToken(), or start issuing v2 tokens (and make sure Workbench recognizes/salts them properly).

       if err == auth.ErrObsoleteToken {
               // FIXME: If the token exists in our own database,
               // salt it for the remote. Otherwise, assume it was
               // issued by the remote, and pass it through
               // unmodified.
               token = creds.Tokens[0]

If we're going to rely on a particular behavior we should document it.

Agreed. Added API token format

Actions #24

Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

Also there should be a documentation page describing how to install & configure arvados-controller (even if it is marked EXPERIMENTAL).

Yes, this is happening on #13497. Waiting for ops feedback about Installing controller service.

  • Looks like we test GET and PATCH, what about POST, DELETE, or weird methods like OPTIONS?

Will add.

  • It looks like TestRemoteError tests for connection error, how about a test where the remote returns 404?

TestNoAccess tests this (get object with no read access = 404).

  • Maybe errors talking to clusters be returned as "Bad Gateway" not "Internal Server Error" ?

Sounds good.

Actions #25

Updated by Tom Clegg over 5 years ago

13493-federation-proxy @ 66fc74001e777ac7ceff2b02cfc459b1368f42f3 https://ci.curoverse.com/job/developer-run-tests/809/
  • upgrade locally-issued v1 tokens to v2 so they can be salted for proxying (it's still desirable to move to issuing v2 tokens, but I don't want that and the associated Workbench updates to block this -- also, even after we do that, the transparent upgrade will allow old tokens to work as well as new tokens)
  • test OPTIONS method (and updated apiserver to return an empty body instead of "-")
Actions #26

Updated by Peter Amstutz over 5 years ago

s.remoteServer.Server.Handler.ServeHTTP(rec, req) // direct to remote -- can't proxy a create req because no uuid

You could look for workflow.owner_uuid and use the the cluster id from that. (Not going to block the branch on this, but we should think about it.)

        err = db.QueryRowContext(req.Context(), `SELECT uuid FROM api_client_authorizations WHERE api_token=$1 AND (expires_at IS NULL OR expires_at > current_timestamp) LIMIT 1`, aca.APIToken).Scan(&aca.UUID)

This line is really long

   def empty
-    render text: "-" 
+    render text: "" 
   end

This is awesome.

So far this looks good. I'm going to do some manual testing / playing around with arvbox.

Actions #27

Updated by Tom Clegg over 5 years ago

  • Target version changed from 2018-07-18 Sprint to 2018-08-01 Sprint
Actions #28

Updated by Peter Amstutz over 5 years ago

I'm happy to say after a bunch or arvbox hacking (adding support for arvados-controller) I was able to do a federated query between two arvbox instances! Both the "arv" command line and workbench were able to retrieve a record from a federated instance.

LGTM!

Actions #29

Updated by Tom Clegg over 5 years ago

  • Target version changed from 2018-08-01 Sprint to 2018-08-15 Sprint
Actions #30

Updated by Tom Clegg over 5 years ago

Peter Amstutz wrote:

Also: please add a documentation page under "Architecture" that describes the current state of federation features.

The architecture section seems like a good place to explain how federation works under the hood, but would it be more appropriate to put the "feature is available" info in a usage context?

I suppose these also have a dependency on docs for setting up the relevant parts of the cluster config and Rails API config.

Actions #31

Updated by Peter Amstutz over 5 years ago

  • Related to Feature #13993: [API] Fetch remote-hosted collection by UUID added
Actions #32

Updated by Tom Clegg over 5 years ago

  • Assigned To deleted (Tom Clegg)
  • Target version changed from 2018-08-15 Sprint to 2018-09-05 Sprint
Actions #33

Updated by Tom Clegg over 5 years ago

  • Assigned To set to Peter Amstutz
Actions #34

Updated by Peter Amstutz over 5 years ago

  • Target version changed from 2018-09-05 Sprint to 2018-09-19 Sprint
Actions #35

Updated by Peter Amstutz over 5 years ago

13493-document-federation @ 5d8eb6ead9e2ca32f424a8385979485a902cf09f

First pass at documenting the mechanics of federation: cluster ids, salted tokens, proxying record & keep block requests.

Actions #36

Updated by Tom Clegg over 5 years ago

There's a lot of information here about how things have been implemented under the hood, which is fine if we can keep it up to date.

We should probably mention that most of the federation story isn't actually implemented/supported yet -- just some specific parts, and with certain simplifying assumptions -- and that the behaviors described here should not be interpreted as a stable API.

This seems like a good place to put some (badly needed) sensible rules for choosing a cluster ID.
  • For throw-away testing purposes, use "z****"
  • For experimental/local clusters that will never be referenced by the outside world, use "x****"
  • For long-lived clusters, contact us to get an ID that won't conflict with anyone else in the future
Actions #37

Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

There's a lot of information here about how things have been implemented under the hood, which is fine if we can keep it up to date.

We should probably mention that most of the federation story isn't actually implemented/supported yet -- just some specific parts, and with certain simplifying assumptions -- and that the behaviors described here should not be interpreted as a stable API.

This seems like a good place to put some (badly needed) sensible rules for choosing a cluster ID.
  • For throw-away testing purposes, use "z****"
  • For experimental/local clusters that will never be referenced by the outside world, use "x****"
  • For long-lived clusters, contact us to get an ID that won't conflict with anyone else in the future

13493-document-federation @ 898cd195f8b5916ed9c3a83dec212e858bfa610f updated taking these comments into account, also adds missing diagram

Actions #38

Updated by Tom Morris over 5 years ago

  • Target version changed from 2018-09-19 Sprint to 2018-10-03 Sprint
Actions #39

Updated by Tom Clegg over 5 years ago

LGTM, thanks

Actions #40

Updated by Peter Amstutz over 5 years ago

  • Status changed from In Progress to Resolved
Actions #41

Updated by Tom Morris over 5 years ago

  • Release set to 14
Actions

Also available in: Atom PDF