Federated collections » History » Version 8

Peter Amstutz, 08/15/2018 07:52 PM

1 1 Peter Amstutz
h1. Federated collections
2 2 Peter Amstutz
3 7 Tom Clegg
In a federation, a client on cluster A can read a collection that is hosted on cluster B. Cluster A pulls the metadata and file content from cluster B as needed. The client's behavior is exactly the same as it is for collections hosted on cluster A.
4 2 Peter Amstutz
5 7 Tom Clegg
* Read collection by uuid
* Read collection by pdh
* Update collection by uuid (not covered here yet; needs a strategy for writing the data through to the remote cluster)
9 1 Peter Amstutz
10 7 Tom Clegg
h2. Differences from federated workflow retrieval
11 1 Peter Amstutz
12 7 Tom Clegg
If the collection is requested from cluster A with @GET /arvados/v1/collections/{uuid}@, cluster A can proxy a request to cluster B, using the same approach used for workflows in #13493.
13 2 Peter Amstutz
14 7 Tom Clegg
If the collection is requested from cluster A with @GET /arvados/v1/collections/{pdh}@, and cluster A does not have a matching collection, it can scan remote clusters until it finds one.
15 1 Peter Amstutz
16 7 Tom Clegg
Once the collection is retrieved, the client also needs to read the data blocks. Without some additional mechanism, this won't work: the local keepstore servers will reject the blob signatures provided by the remote cluster, and they generally won't have the requested data anyway.
17 1 Peter Amstutz
18 7 Tom Clegg
h2. Remote data hints
19 1 Peter Amstutz
20 7 Tom Clegg
If cluster A uses a salted token to retrieve a collection from cluster B, cluster B provides a signed manifest:
21 5 Peter Amstutz
22 7 Tom Clegg
. acbd18db4cc2f85cedef654fccc4a4d8+3+Aabcdef@12345678 0:3:foo.txt
25 3 Peter Amstutz
26 8 Peter Amstutz
Cluster A propagates cluster B's signature but modifies it to be a remote cluster signature:
27 3 Peter Amstutz
28 7 Tom Clegg
29 8 Peter Amstutz
. acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-abcdef@12345678 0:3:foo.txt
30 7 Tom Clegg
31 3 Peter Amstutz
32 7 Tom Clegg
Any keepstore service on cluster A will be able to fetch the block from cluster B:
* Look up bbbbb in remote cluster list in discovery doc
* Look up bbbbb's keepproxy address in bbbbb's discovery doc
* Fetch <code>https://{keepproxy}/acbd18db4cc2f85cedef654fccc4a4d8+3+Abcdefa@12345678</code>
36 5 Peter Amstutz
37 7 Tom Clegg
h2. Remote signature hint
38 3 Peter Amstutz
39 7 Tom Clegg
Possible syntaxes:
* acbd18db4cc2f85cedef654fccc4a4d8+3+Abbbbb-bcdefa@12345678
* acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-bcdefa@12345678
42 3 Peter Amstutz
43 7 Tom Clegg
The chosen syntax must support having both local and remote signatures on a single locator. This can help a sophisticated (future) controller communicate securely to keepstore, on a per-block or per-collection basis, whether keepstore should skip contacting the remote cluster when returning remote data that also happens to be stored locally.
* acbd18db4cc2f85cedef654fccc4a4d8+3+Abbbbb-bcdefa@12345678+Aabcdef@12345678
* acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-bcdefa@12345678+Aabcdef@12345678
46 3 Peter Amstutz
47 7 Tom Clegg
h2. Optimization: Data cache on cluster A
48 3 Peter Amstutz
49 7 Tom Clegg
A keepstore service on cluster A, when proxying a GET request to cluster B, has some opportunities to conserve network resources:
# Before proxying, check whether the block exists on a local volume. If so:
## Request a content challenge from the remote cluster to ensure the remote cluster does in fact have the data. (This can be skipped if cluster A trusts cluster B to enforce data access permissions.)
## Return the local copy.
# When passing a proxied response through to the client, write the data to a local volume as well, so it can be returned more efficiently next time.
54 3 Peter Amstutz
55 7 Tom Clegg
h2. Optimization: Identical content exists on cluster A
56 3 Peter Amstutz
57 7 Tom Clegg
When proxying a "get collection by UUID" request to cluster B, cluster A might notice that the PDH returned by cluster B matches a collection stored on cluster A. In this case, all data blocks are already stored locally: it can replace the cluster B's signatures with its own, and the client will end up reading the blocks from local volumes.
58 3 Peter Amstutz
59 7 Tom Clegg
To avoid an information leak, a configuration setting can restrict this optimization to cases where the caller's token has permission to read the existing local collection.
h2. Implementation
* #13993 [API] Fetch remote-hosted collection by UUID
* #13994 [Keepstore] Fetch blocks from federated clusters