Federated collections » History » Version 3

Peter Amstutz, 08/01/2018 08:23 PM

1 1 Peter Amstutz
h1. Federated collections
2 2 Peter Amstutz
3 2 Peter Amstutz
* Fetch collection record by uuid
4 2 Peter Amstutz
** use federated record retrieval strategy, already developed.
5 2 Peter Amstutz
* Fetch collection record by PDH
6 3 Peter Amstutz
** No location hint.  Distribute request to all federated clusters and pick one to return.
7 2 Peter Amstutz
** Read-only, only need to support GET operation
8 2 Peter Amstutz
* Can cache result by PDH.
9 2 Peter Amstutz
10 2 Peter Amstutz
Record will have a manifest with signed blocks.  However these blocks will be signed for the origin cluster.
11 2 Peter Amstutz
12 2 Peter Amstutz
Client needs to be able to fetch blocks from remote cluster.
13 2 Peter Amstutz
14 2 Peter Amstutz
arvados-controller could add block hints, using existing feature in the Python and Go SDK:
15 2 Peter Amstutz
16 2 Peter Amstutz
* Blocks in a manifest can include a hint in the form "+K@zzzzz".  Python SDK will attempt to fetch the block from "https://keep.zzzzz.arvadosapi.com/"
17 2 Peter Amstutz
** Must conform to a particular naming DNS scheme.
18 2 Peter Amstutz
** Could be generalized by looking up in "remote_hosts" and using the "keep_services.accessible" API.
19 2 Peter Amstutz
** Every block will be requested from remote every time, because client is contacting remote server directly, limited opportunity for edge caching.
20 2 Peter Amstutz
21 2 Peter Amstutz
* Hint can also be a uuid of a "local gateway service".  This is instructs client to use a specific service from the keep_services table (indicated as "service_type" of "gateway:")
22 2 Peter Amstutz
** Direct requests through a specific service
23 2 Peter Amstutz
** Does not encode which remote cluster to pull a block from.
24 2 Peter Amstutz
** Gateway service could search for blocks by sending request to every federated cluster
25 1 Peter Amstutz
** Gateway service can cache blocks so they don't need to be re-fetched from remote.
26 3 Peter Amstutz
27 3 Peter Amstutz
Both "hint" schemes are slightly inelegant because they require repeating the "+K@" hint for ever block in the manifest.
28 3 Peter Amstutz
29 3 Peter Amstutz
We probably want an architecture that makes block caching possible, even if the first pass implementation doesn't support it.  That implies a gateway / proxy service rather than contacting the remote cluster directly (architecturally, this is also more in line with arvados-controller design acting as an intermediary, as opposed to adding federation features in the client.)
30 3 Peter Amstutz
31 3 Peter Amstutz
Proposal:
32 3 Peter Amstutz
33 3 Peter Amstutz
Arvados-controller decorates blocks with "+K@zzzzz" hints but change the implementation so that instead of the client contacting the remote host, the client contacts the local gateway service and requests the block with the cluster hint and block signature (which is returned by the remote cluster).
34 3 Peter Amstutz
35 3 Peter Amstutz
The local gateway services requests the block from the appropriate cluster, returns the result.
36 3 Peter Amstutz
37 3 Peter Amstutz
A simple caching strategy would be to copy the block to local keep storage, and maintain a mapping from the remote signature(s) to a local signature.  If a request comes for a block which has recently been fetched, it can issue a HEAD request to verify the signature and then remember the signature.
38 3 Peter Amstutz
39 3 Peter Amstutz
Fetching collection flow:
40 3 Peter Amstutz
41 3 Peter Amstutz
# Running on cluster aaaaa
42 3 Peter Amstutz
# Client sends request to arvados-controller by PDH
43 3 Peter Amstutz
# arvados-controller searches local database and comes up empty.
44 3 Peter Amstutz
# arvados-controller sends request for collection by PDH (with salted token) out to federated clusters bbbbb and ccccc
45 3 Peter Amstutz
# ccccc returns result
46 3 Peter Amstutz
# arvados-controller decorates the return record with "+K@ccccc" block hints
47 3 Peter Amstutz
# return record to client
48 3 Peter Amstutz
49 3 Peter Amstutz
Fetching block flow:
50 3 Peter Amstutz
51 3 Peter Amstutz
# client wishes to read a file
52 3 Peter Amstutz
# client has signed block locator with "+K@ccccc" hint
53 3 Peter Amstutz
# client sends request to "gateway" Keep service
54 3 Peter Amstutz
# gateway keep service contacts keepproxy on cluster ccccc and requests block
55 3 Peter Amstutz
# keepproxy on ccccc returns block content to gateway
56 3 Peter Amstutz
# gateway returns block content to client
57 3 Peter Amstutz
58 3 Peter Amstutz
Fetching block, with caching:
59 3 Peter Amstutz
60 3 Peter Amstutz
# client wishes to read a file
61 3 Peter Amstutz
# client has signed block locator with "+K@ccccc" hint
62 3 Peter Amstutz
# client sends request to "gateway" Keep service
63 3 Peter Amstutz
# gateway service looks up block in memory / local database
64 3 Peter Amstutz
## if found, check if the block signature is cached
65 3 Peter Amstutz
## if block signature isn't cached, send HEAD request to ccccc
66 3 Peter Amstutz
## if the signature checks out, fetch the block from aaaaa local keepstore and returns that.
67 3 Peter Amstutz
## else fail (because HEAD request must have failed)
68 3 Peter Amstutz
# gateway keep service contacts keepproxy on cluster ccccc and requests block
69 3 Peter Amstutz
# keepproxy on ccccc returns block content to gateway
70 3 Peter Amstutz
# gateway saves block to aaaaa local keep, records mapping of remote block+signature to local block+signature (could be in memory, or local database such as sqlite)
71 3 Peter Amstutz
# gateway returns block content to client