Project

General

Profile

Keep service hints » History » Version 3

Tom Clegg, 03/24/2015 02:33 PM

1 1 Tom Clegg
h1. Keep service hints
2
3
h3. Objective
4
5
Clients can use the Keep API to retrieve data that is stored on remote servers. As with local data, permission tokens and data hashes are provided by the API server in a manifest.
6
7
h3. Background
8
9
When a client reads a data block referenced by a manifest, it requests a list of "available keep services" from the API server and (if there is more than one "disk" service on the list) uses the rendezvous hashing algorithm to select one.
10
11
h3. Alternatives
12
13
Client libraries could communicate directly with non-Keep services.
14
* It would be impossible to use Arvados permission controls.
15
* An N×M array of code would have to be maintained in order to support N backing services from M SDK languages.
16
* The API server would have to maintain the mapping of hashes to remote data objects (and permissions for this map).
17
18
Each keepstore server could know how to communicate with each non-Keep service in use.
19
* Simpler client code.
20
* Artificial link between keep disk services and gateway services (they couldn't be independently scaled or shut down for maintenance).
21
* External clients couldn't be given direct access to the third-party gateway services without also giving them direct access to the disk services.
22
* Either the keepstore servers would have to keep their hash-to-remote-object mappings synchronized -- or the map of hash to remote service would be distributed across various servers. Either way introduces an unsuitable level of complexity: unlike in a native keepstore system, the underlying data is expected to change over time.
23
* When encountering an error (notably 404), client code would make many redundant attempts to read from various gateway services, based on the mistaken assumption that the various services have different sets of available data blocks.
24
25
h3. High level design
26
27
Tools (TBD) can create manifests with @+Kuuid@ hints, referencing data in remote storage services by indicating the UUID of a storage gateway capable of accessing it. _In future work, Arvados can manage these hints actively: for example, data manager could tag blocks with S3 bucket names, and API server could load-balance S3 gateways by selecting one of several available gateway UUIDs for a given block._
28
29
Each client library knows that when it sees @+Kuuid@ it should connect to the keep service with the given UUID (instead of using the usual rendezvous hashing algorithm to select a service).
30
31
h2. Specifics
32
33 2 Tom Clegg
A block locator provided by the API server in a manifest might have a hint of the form @+Kuuid@ where @uuid@ is the UUID of a keep service. In order to retrieve the block data, the client should look up the keep service with the given UUID, and perform an HTTP @GET@ request at the appropriate host and port.
34 1 Tom Clegg
35 2 Tom Clegg
* Given @acbd18db4cc2f85cedef654fccc4a4d8+3+K1h9kt-bi6l4-20fty0xbp8l9wwe@,
36
** Retrieve @https://1h9kt.arvadosapi.com/arvados/v1/keep_services/1h9kt-bi6l4-20fty0xbp8l9wwe@ to determine scheme, host, port
37
** Retrieve data from @{scheme}://{host}{port}/acbd18db4cc2f85cedef654fccc4a4d8+3+K1h9kt-bi6l4-20fty0xbp8l9wwe@
38
39
As before, if a hint of the form @+K{prefix}@ is given (where @{prefix}@ is a string of five characters in @[0-9a-z]@), the client should perform a @GET@ request at @https://keep.{prefix}.arvadosapi.com/locator@.
40
41
* Given @acbd18db4cc2f85cedef654fccc4a4d8+3+K1h9kt@,
42 3 Tom Clegg
** Retrieve data from @https://keep.1h9kt.arvadosapi.com/acbd18db4cc2f85cedef654fccc4a4d8+3+K1h9kt@
43 1 Tom Clegg
44
h2. Future work
45
46
Data manager can update manifests to reflect additional locations where data blocks can be retrieved: for example, @+Kuuid1+Kuuid2@ to signify that multiple remote gateways can retrieve the data, or @+K+Kuuid1@ to signify that the data is available locally _and_ via a remote gateway.