- Table of contents
- Keep S3 gateway
Keep S3 gateway¶
- Status: DRAFT. Numerous open questions.
- See Keep service hints for more background.
Overview¶
Objective¶
The Keep S3 gateway is a Keep-compatible interface to Amazon S3. It allows programs like Workbench, arv-get, and arv-mount to read data that is stored in S3, without adding any S3-specific code.
Currently, it does not address writing to S3. It is useful in situations where some data is already stored in S3 -- and should continue to be stored only in S3, rather than making a local copy -- and that data is to be used by Arvados programs: for example, running a Crunch job using publicly available S3-hosted datasets as input.
High level design¶
Remotely stored data available from a given Arvados installation is supported by a gateway server process, similar to keepstore
but running with options like -volume=s3:/mapping-store-path:/s3-credentials-path
instead of -volumes=/tmp/1,/tmp/2
.
Operations to support:
- Given an S3 bucket and optional prefix/object, read the data from S3, update the {locator, S3 segment} map, and return signed block locators to the client.
- Given an S3 bucket and optional prefix/object, create a collection that references the S3 data and return the collection UUID. (This is more suitable for larger datasets because the data transfer can be done asynchronously after the collection UUID has been returned to the client.)
- Given a locator, read the data from S3 and return it to the client.
Specifics¶
Detailed design¶
API for writing¶
POST /manifest_text
- read objects from S3 and add/update map entries. Respond with a manifest that references the indexed data.
- If the request body is of the form
{"S3Path":"s3://abucket/aprefix/anobject"}
-- read segments (up to 64MiB each) from the specified object and construct a manifest with a single file. - If the request body is of the form
{"S3Path":"s3://abucket/aprefix/"}
or{"S3Path":"s3://abucket/"}
-- read all objects (with the given prefix, if any) from the bucket and construct a manifest with one file per object read. - It is easy to make a request that takes a long time and generates lots of network traffic. As a minimum, the worker must exit if the client closes the connection.
POST /collection
- read objects from S3, add/update map entries, and add the objects as files in a new collection. Respond with the UUID of the new collection.
- Request body is a JSON-encoded hash with an
"S3Path"
key, as withPOST /manifest_text
- If request body hash has a
"collection"
key, its value must be a hash, and it will be passed to arvados.v1.collections.create when creating the new collection. (This sets the parent project, name, and description of the new collection, for example.) This"collection"
hash must not contain a"manifest_text"
key.
Example request:
POST /collection HTTP/1.1 Host: zzzzz.arvadosapi.com Content-type: application/json { "S3Path":"s3://abucket/aprefix/anobject", "collection":{ "owner_uuid":"zzzzz-j7d0g-12345abcde12345", "name":"Something stored in S3" } }
Response:
HTTP/1.1 200 OK Content-type: application/json { "uuid":"zzzzz-4zz18-abcde12345abcde" }
DELETE /collection/{uuid}
- stop working on a job that was started with POST /collection
.
- Returns 404 if a collection with the specified uuid does not exist (according to the API server, when using the token provided in the
DELETE
request). - Returns 200 if a collection with the specified uuid does exist and this server is not doing any work on it (regardless of whether work was being done before the
DELETE
request). - Returns 5xx if an error prevented the server from asserting either of the above cases (e.g., work could not be cancelled, or there was an error looking up the collection).
API for reading¶
GET /{locator}
- Look up the given locator in the map. Retrieve the data segment from S3, and return it to the client. Return an error if the hash of the retrieved data does not match the locator.
- Verify the signature portion of the locator (
+Ahash@timestamp
) before doing anything else.
Mapping locators to S3 objects¶
The {hash, remote object} mapping can be stored in the local filesystem.- A given hash can map to more than one remote object. It's worth remembering all such remote objects: if one disappears or changes, a different one should be attempted next. Suggestion: For each hash, we have a text file with one line per remote data object matching the hash.
- When remote objects are bigger than 64 MiB, the mapping will actually be {hash, remote object segment}. This should be easy to manage if remote object references are always stored as
"offset:length:remote_object_path"
.
Code Location¶
The source code for the server command will be in /services/keepgw-s3
.
- keepstore logs & answers client queries, verifies hashes, answers index/status queries, reads/writes data blocks on disk, enforces per-disk mutexes.
- keepproxy logs & answers client queries, verifies hashes, connects to other keep services.
- keepgw logs & answers client queries, verifies hashes, answers index/status queries, reads/writes a local {hash, remote object} index, connects to remote services.
- Refactor the keepstore command to consist of just the "unix volume" code; move everything else into packages like keep_server and hash_checking_reader. Create a new keepgw-s3 command.
- Extend the keepstore command to use backing-store modules like -volume=unix:/foo and -volume=s3:bucketid.
- Extend the keepproxy command to use backing-store modules like S3 as an alternative to keep disk services.
Open questions and risks¶
How does the gateway know its own UUID so it can write the appropriate +Kuuid locator hints when constructing a manifest?
How does a client know how much progress has been made on a"POST /collection"
request? The worker could update the collection object each time an object is written, or each time a locator (64 MiB segment) is indexed, and this behavior could be toggled during the initial API call. But how does a caller know whether the work is finished?
- The collection "expires_at" attribute could be set to some non-null value, to indicate that the collection is ephemeral, until it is complete. This would help avoid accidental use of partially-written collections. It would also provide automatic clean-up of partially written collections, but still permit "resume" (assuming "resume" starts before the
expires_at
time arrives). - How should the worker communicate the expected total collection size, number of blocks/files, or finish time? This could be written in the collection's
properties
hash under a key chosen by convention. (In the pathological case where the client provides a conflicting key in{"collection":{"properties":{...}}}
then progress information would be unavailable.)
"POST /collection"
request has been abandoned?
- The worker could delete the collection in case of error, but this would make "resume" impossible.
"POST /manifest_text"
request?
- Use HTTP chunked transfer encoding to return the manifest text one token at a time? (This could also help detect closed connections sooner.)
The DELETE /collection/{uuid}
API cancels a worker thread, but at face value looks like it will delete a collection. (In a sense it can delete a collection by cancelling work and leaving the partial collection with a non-null expires_at
value, but if the job is finished, the effect is nothing at all like "delete collection".) Perhaps it should be renamed to something more like DELETE /queue/{uuid}
? Perhaps it should have different responses for "cancelled as a result of this request" and "already cancelled or was never happening"?
{ "locator" : [{"S3Path":"bucket/prefix/object","credential_id":"local_credential_set_id"}, ...], ... }
local_credential_set_id
could be the hash (and filename of local cache) of a key pair.
Other than by having admin privileges, how can a client establish permission to use S3 credentials (which are already known by the gateway server) during a POST request?
Future work¶
A "resume" API would be useful.
Updated by Tom Clegg over 9 years ago · 4 revisions