Federated collections » History » Version 7
Tom Clegg, 08/09/2018 07:03 PM
1 | 1 | Peter Amstutz | h1. Federated collections |
---|---|---|---|
2 | 2 | Peter Amstutz | |
3 | 7 | Tom Clegg | In a federation, a client on cluster A can read a collection that is hosted on cluster B. Cluster A pulls the metadata and file content from cluster B as needed. The client's behavior is exactly the same as it is for collections hosted on cluster A. |
4 | 2 | Peter Amstutz | |
5 | 7 | Tom Clegg | Cases: |
6 | * Read collection by uuid |
||
7 | * Read collection by pdh |
||
8 | * Update collection by uuid (not covered here yet; needs a strategy for writing the data through to the remote cluster) |
||
9 | 1 | Peter Amstutz | |
10 | 7 | Tom Clegg | h2. Differences from federated workflow retrieval |
11 | 1 | Peter Amstutz | |
12 | 7 | Tom Clegg | If the collection is requested from cluster A with @GET /arvados/v1/collections/{uuid}@, cluster A can proxy a request to cluster B, using the same approach used for workflows in #13493. |
13 | 2 | Peter Amstutz | |
14 | 7 | Tom Clegg | If the collection is requested from cluster A with @GET /arvados/v1/collections/{pdh}@, and cluster A does not have a matching collection, it can scan remote clusters until it finds one. |
15 | 1 | Peter Amstutz | |
16 | 7 | Tom Clegg | Once the collection is retrieved, the client also needs to read the data blocks. Without some additional mechanism, this won't work: the local keepstore servers will reject the blob signatures provided by the remote cluster, and they generally won't have the requested data anyway. |
17 | 1 | Peter Amstutz | |
18 | 7 | Tom Clegg | h2. Remote data hints |
19 | 1 | Peter Amstutz | |
20 | 7 | Tom Clegg | If cluster A uses a salted token to retrieve a collection from cluster B, cluster B provides a signed manifest: |
21 | 5 | Peter Amstutz | |
22 | 7 | Tom Clegg | <pre> |
23 | . acbd18db4cc2f85cedef654fccc4a4d8+3+Aabcdef@12345678 0:3:foo.txt |
||
24 | </pre> |
||
25 | 3 | Peter Amstutz | |
26 | 7 | Tom Clegg | Cluster A propagates cluster B's signature but includes the remote cluster ID: |
27 | 3 | Peter Amstutz | |
28 | 7 | Tom Clegg | <pre> |
29 | . acbd18db4cc2f85cedef654fccc4a4d8+3+Abbbbb-abcdef@12345678 0:3:foo.txt |
||
30 | </pre> |
||
31 | 3 | Peter Amstutz | |
32 | 7 | Tom Clegg | Any keepstore service on cluster A will be able to fetch the block from cluster B: |
33 | * Look up bbbbb in remote cluster list in discovery doc |
||
34 | * Look up bbbbb's keepproxy address in bbbbb's discovery doc |
||
35 | * Fetch <code>https://{keepproxy}/acbd18db4cc2f85cedef654fccc4a4d8+3+Abcdefa@12345678</code> |
||
36 | 5 | Peter Amstutz | |
37 | 7 | Tom Clegg | h2. Remote signature hint |
38 | 3 | Peter Amstutz | |
39 | 7 | Tom Clegg | Possible syntaxes: |
40 | * acbd18db4cc2f85cedef654fccc4a4d8+3+Abbbbb-bcdefa@12345678 |
||
41 | * acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-bcdefa@12345678 |
||
42 | 3 | Peter Amstutz | |
43 | 7 | Tom Clegg | The chosen syntax must support having both local and remote signatures on a single locator. This can help a sophisticated (future) controller communicate securely to keepstore, on a per-block or per-collection basis, whether keepstore should skip contacting the remote cluster when returning remote data that also happens to be stored locally. |
44 | * acbd18db4cc2f85cedef654fccc4a4d8+3+Abbbbb-bcdefa@12345678+Aabcdef@12345678 |
||
45 | * acbd18db4cc2f85cedef654fccc4a4d8+3+Rbbbbb-bcdefa@12345678+Aabcdef@12345678 |
||
46 | 3 | Peter Amstutz | |
47 | 7 | Tom Clegg | h2. Optimization: Data cache on cluster A |
48 | 3 | Peter Amstutz | |
49 | 7 | Tom Clegg | A keepstore service on cluster A, when proxying a GET request to cluster B, has some opportunities to conserve network resources: |
50 | # Before proxying, check whether the block exists on a local volume. If so: |
||
51 | ## Request a content challenge from the remote cluster to ensure the remote cluster does in fact have the data. (This can be skipped if cluster A trusts cluster B to enforce data access permissions.) |
||
52 | ## Return the local copy. |
||
53 | # When passing a proxied response through to the client, write the data to a local volume as well, so it can be returned more efficiently next time. |
||
54 | 3 | Peter Amstutz | |
55 | 7 | Tom Clegg | h2. Optimization: Identical content exists on cluster A |
56 | 3 | Peter Amstutz | |
57 | 7 | Tom Clegg | When proxying a "get collection by UUID" request to cluster B, cluster A might notice that the PDH returned by cluster B matches a collection stored on cluster A. In this case, all data blocks are already stored locally: it can replace the cluster B's signatures with its own, and the client will end up reading the blocks from local volumes. |
58 | 3 | Peter Amstutz | |
59 | 7 | Tom Clegg | To avoid an information leak, a configuration setting can restrict this optimization to cases where the caller's token has permission to read the existing local collection. |
60 | |||
61 | h2. Implementation |
||
62 | |||
63 | * #13993 [API] Fetch remote-hosted collection by UUID |
||
64 | * #13994 [Keepstore] Fetch blocks from federated clusters |