Feature #13994

[Keepstore] Fetch blocks from federated clusters

Added by Peter Amstutz over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/12/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0
Release:
Release relationship:
Auto

Description

Proxy requests for blocks stored on remote clusters.
  • To service a GET request with a remote data hint (and no local signature), skip the usual volume scan and use the usual Keep client logic to fetch the block from the remote cluster indicated by the hint.

See Federated collections.

This issue does not include cache optimizations (e.g., writing a copy of the retrieved data on a local volume).


Subtasks

Task #14175: Review 13994-proxy-remoteResolvedPeter Amstutz


Related issues

Related to Arvados - Feature #13993: [API] Fetch remote-hosted collection by UUIDResolved09/07/2018

Associated revisions

Revision 8c30c649
Added by Tom Clegg about 1 year ago

Merge branch '13994-proxy-remote'

refs #13994

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz over 1 year ago

  • Status changed from In Progress to New

#3 Updated by Peter Amstutz over 1 year ago

  • Related to Feature #13993: [API] Fetch remote-hosted collection by UUID added

#4 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#5 Updated by Peter Amstutz over 1 year ago

For the purposes of completeness.

Another caching strategy that was discussed was for the API server to sign blocks with both a local signature and remote signature, and keepstore wolud return a locally stored block (if available) or retrieve from remote store.

  • The API server can efficiently determine if a collection (by PDH) is known and readable by the user. So, arvados-controller could look at the remote collection record PDH and check if a collection with the same PDH is already available locally. However this means to get the benefit of caching requires explicitly copying the collection, and there's no benefit without a local collection with the same PDH.
  • A (new) lookup table in the API server could map blocks to collection PDH. This would make it possible to determine at a block level which blocks are reable by the user. This helps if a remote collection shares some blocks with local collections but are not exactly the same. There are no cache benefits if none of the blocks are known to the API server.
  • The mapping of (cluster, block, signature) could be stored in the postgres database instead of locally to keepstore.

#6 Updated by Peter Amstutz over 1 year ago

  • Description updated (diff)

#7 Updated by Tom Morris over 1 year ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 3.0
  • Release set to 14

#8 Updated by Tom Clegg over 1 year ago

  • Description updated (diff)

#10 Updated by Tom Morris over 1 year ago

  • Assigned To set to Tom Clegg
  • Target version changed from Arvados Future Sprints to 2018-09-19 Sprint

#11 Updated by Tom Clegg about 1 year ago

13994-proxy-remote @ 55f6178b9a9a0d165e952eeec9a04d0234299397

As discussed offline, we're acknowledging this won't be especially useful until API starts handing out V2 tokens (#14196).

TODO: document "deploy your /etc/arvados/config.yml on keepstore nodes" in install + upgrading pages.

#12 Updated by Peter Amstutz about 1 year ago

            remote, ok := cluster.RemoteClusters[remoteID]

It occurred to me that this won't work for the case of finding clusters by DNS, as opposed to explicitly configured ones. Shall we introduce cluster.GetRemoteCluster(remoteID) instead of cluster.RemoteClusters[remoteID] ?

#13 Updated by Tom Clegg about 1 year ago

  • Status changed from New to In Progress

Sure. Fetching by PDH also won't work with clusters that aren't explicitly configured by DNS. I think at this point we should stay focused on the cases where the remotes know about one another in advance.

#14 Updated by Peter Amstutz about 1 year ago

Took a while to track this one down (manual testing) -

It turns out that the API server signs blocks using the only last part of the v2 token (specifically, current_api_client_authorization.api_token only contains the secret part), but keepstore checks signatures using the entire v2 token (uuid and all). As a result the remote keepstore returns 403 Forbidden. We need to choose which signing scheme is correct (and make sure it is documented.) Personally I think it would be more consistent if it continued to use the entire token string, which means fixing the API server.

Also, the 403 Forbidden (returned by the remote keepstore) gets turned into a 404 by the local keepstore, which is misleading. For debugging sanity, it should probably return 403.

#15 Updated by Tom Clegg about 1 year ago

Yes, signing with the entire string seems simpler -- it means many clients don't need to understand the token format at all. I think this is just a matter of remembering the supplied token in Thread.current (source:services/api/app/middlewares/arvados_api_token.rb) and using that instead of a_c_a.api_token when signing (source:services/api/app/models/collection.rb).

#16 Updated by Tom Clegg about 1 year ago

13994-proxy-remote @ 8e7c30852f1cf244ae3c58e93acea705739e8625 https://ci.curoverse.com/view/Developer/job/developer-run-tests/896/
  • api uses entire v2 token to make/check signatures

#17 Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote:

13994-proxy-remote @ 8e7c30852f1cf244ae3c58e93acea705739e8625 https://ci.curoverse.com/view/Developer/job/developer-run-tests/896/
  • api uses entire v2 token to make/check signatures

tools/keep-block-check test is failing which seems like an awful cooincidence

#19 Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote:

13994-proxy-remote @ d9224c40587a8d3617e7be01f3bb7f801c4b52e4 https://ci.curoverse.com/view/Developer/job/developer-run-tests/897/
  • revisit async RefreshServiceDiscovery change

Tests are passing, except a PhantomJS error.

#20 Updated by Peter Amstutz about 1 year ago

In ddc180fa5200d9d8fac59cc5041d7d452b68e6a2

In the previous code, in what cases would the "default" branch of select have been activated, and why is that no longer necessary in the new commit?

If ent.clear is closed, does this panic? (That would prevent wg.Done() from being called).

#21 Updated by Tom Clegg about 1 year ago

The default case of a select runs if none of the channels are ready -- in this case, if the channel (size=1) already had a value in it, a refresh was already pending, so there was no need to request another.

The previous change tried to unblock the update function, adding the select and thereby changing the semantics from "request a refresh, and block if needed to ensure previous cache will not be used any more" to just "request a refresh". But the keep-block-check tests were relying on the second part, so this reverts to the blocking version.

If someone closed ent.clear then yes, sending to the closed channel would panic, and the program would exit without calling Done().

#22 Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote:

The default case of a select runs if none of the channels are ready -- in this case, if the channel (size=1) already had a value in it, a refresh was already pending, so there was no need to request another.

The previous change tried to unblock the update function, adding the select and thereby changing the semantics from "request a refresh, and block if needed to ensure previous cache will not be used any more" to just "request a refresh". But the keep-block-check tests were relying on the second part, so this reverts to the blocking version.

If someone closed ent.clear then yes, sending to the closed channel would panic, and the program would exit without calling Done().

Since it is in a goroutine, I don't believe it would cause the program to exit, wouldn't it crash the goroutine, so then the caller would be deadlocked? How about this?

        wg.Add(1)
        go func() {
            defer wg.Done()
            ent.clear <- struct{}{}
        }()

(Sorry to harp on this, I think I understand what you're trying to do but not exactly why it was needed for this particular branch.)

#23 Updated by Tom Clegg about 1 year ago

#24 Updated by Peter Amstutz about 1 year ago

Tom Clegg wrote:

An unrecovered panic really does crash the program. https://golang.org/ref/spec#Handling_panics https://play.golang.org/p/iUzMTzUckzF

Got it.

This LGTM, please merge.

#25 Updated by Tom Clegg about 1 year ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF