Keep S3 gateway

Overview

Objective

The Keep S3 gateway is a Keep-compatible interface to Amazon S3. It allows programs like Workbench, arv-get, and arv-mount to read data that is stored in S3, without adding any S3-specific code.

Currently, it does not address writing to S3. It is useful in situations where some data is already stored in S3 -- and should continue to be stored only in S3, rather than making a local copy -- and that data is to be used by Arvados programs: for example, running a Crunch job using publicly available S3-hosted datasets as input.

High level design

Remotely stored data available from a given Arvados installation is supported by a gateway server process, similar to keepstore but running with options like -volume=s3:/mapping-store-path:/s3-credentials-path instead of -volumes=/tmp/1,/tmp/2.

Operations to support:

  • Given an S3 bucket and optional prefix/object, read the data from S3, update the {locator, S3 segment} map, and return signed block locators to the client.
  • Given an S3 bucket and optional prefix/object, create a collection that references the S3 data and return the collection UUID. (This is more suitable for larger datasets because the data transfer can be done asynchronously after the collection UUID has been returned to the client.)
  • Given a locator, read the data from S3 and return it to the client.

Specifics

Detailed design

API for writing

POST /manifest_text - read objects from S3 and add/update map entries. Respond with a manifest that references the indexed data.
  • If the request body is of the form {"S3Path":"s3://abucket/aprefix/anobject"} -- read segments (up to 64MiB each) from the specified object and construct a manifest with a single file.
  • If the request body is of the form {"S3Path":"s3://abucket/aprefix/"} or {"S3Path":"s3://abucket/"} -- read all objects (with the given prefix, if any) from the bucket and construct a manifest with one file per object read.
  • It is easy to make a request that takes a long time and generates lots of network traffic. As a minimum, the worker must exit if the client closes the connection.
POST /collection - read objects from S3, add/update map entries, and add the objects as files in a new collection. Respond with the UUID of the new collection.
  • Request body is a JSON-encoded hash with an "S3Path" key, as with POST /manifest_text
  • If request body hash has a "collection" key, its value must be a hash, and it will be passed to arvados.v1.collections.create when creating the new collection. (This sets the parent project, name, and description of the new collection, for example.) This "collection" hash must not contain a "manifest_text" key.

Example request:

POST /collection HTTP/1.1
Host: zzzzz.arvadosapi.com
Content-type: application/json

{
 "S3Path":"s3://abucket/aprefix/anobject",
 "collection":{
  "owner_uuid":"zzzzz-j7d0g-12345abcde12345",
  "name":"Something stored in S3" 
 }
}

Response:

HTTP/1.1 200 OK
Content-type: application/json

{
 "uuid":"zzzzz-4zz18-abcde12345abcde" 
}

DELETE /collection/{uuid} - stop working on a job that was started with POST /collection.
  • Returns 404 if a collection with the specified uuid does not exist (according to the API server, when using the token provided in the DELETE request).
  • Returns 200 if a collection with the specified uuid does exist and this server is not doing any work on it (regardless of whether work was being done before the DELETE request).
  • Returns 5xx if an error prevented the server from asserting either of the above cases (e.g., work could not be cancelled, or there was an error looking up the collection).

API for reading

GET /{locator} - Look up the given locator in the map. Retrieve the data segment from S3, and return it to the client. Return an error if the hash of the retrieved data does not match the locator.
  • Verify the signature portion of the locator (+Ahash@timestamp) before doing anything else.

Mapping locators to S3 objects

The {hash, remote object} mapping can be stored in the local filesystem.
  • A given hash can map to more than one remote object. It's worth remembering all such remote objects: if one disappears or changes, a different one should be attempted next. Suggestion: For each hash, we have a text file with one line per remote data object matching the hash.
  • When remote objects are bigger than 64 MiB, the mapping will actually be {hash, remote object segment}. This should be easy to manage if remote object references are always stored as "offset:length:remote_object_path".

Code Location

The source code for the server command will be in /services/keepgw-s3.

Likely, some parts of keepproxy and keepstore should be refactored to share code more effectively.
  • keepstore logs & answers client queries, verifies hashes, answers index/status queries, reads/writes data blocks on disk, enforces per-disk mutexes.
  • keepproxy logs & answers client queries, verifies hashes, connects to other keep services.
  • keepgw logs & answers client queries, verifies hashes, answers index/status queries, reads/writes a local {hash, remote object} index, connects to remote services.
Possibilities:
  • Refactor the keepstore command to consist of just the "unix volume" code; move everything else into packages like keep_server and hash_checking_reader. Create a new keepgw-s3 command.
  • Extend the keepstore command to use backing-store modules like -volume=unix:/foo and -volume=s3:bucketid.
  • Extend the keepproxy command to use backing-store modules like S3 as an alternative to keep disk services.

Open questions and risks

How does the gateway know its own UUID so it can write the appropriate +Kuuid locator hints when constructing a manifest?

How does a client know how much progress has been made on a "POST /collection" request? The worker could update the collection object each time an object is written, or each time a locator (64 MiB segment) is indexed, and this behavior could be toggled during the initial API call. But how does a caller know whether the work is finished?
  • The collection "expires_at" attribute could be set to some non-null value, to indicate that the collection is ephemeral, until it is complete. This would help avoid accidental use of partially-written collections. It would also provide automatic clean-up of partially written collections, but still permit "resume" (assuming "resume" starts before the expires_at time arrives).
  • How should the worker communicate the expected total collection size, number of blocks/files, or finish time? This could be written in the collection's properties hash under a key chosen by convention. (In the pathological case where the client provides a conflicting key in {"collection":{"properties":{...}}} then progress information would be unavailable.)
How does a client know when a "POST /collection" request has been abandoned?
  • The worker could delete the collection in case of error, but this would make "resume" impossible.
How should the server indicate to the client that progress is being made during a "POST /manifest_text" request?
  • Use HTTP chunked transfer encoding to return the manifest text one token at a time? (This could also help detect closed connections sooner.)

The DELETE /collection/{uuid} API cancels a worker thread, but at face value looks like it will delete a collection. (In a sense it can delete a collection by cancelling work and leaving the partial collection with a non-null expires_at value, but if the job is finished, the effect is nothing at all like "delete collection".) Perhaps it should be renamed to something more like DELETE /queue/{uuid}? Perhaps it should have different responses for "cancelled as a result of this request" and "already cancelled or was never happening"?

Should there be an API for providing credentials in a POST request? The choice(s) of credentials to use for each data segments could be stored in the map:
  • {
     "locator" : [{"S3Path":"bucket/prefix/object","credential_id":"local_credential_set_id"}, ...],
     ...
    }
    
  • local_credential_set_id could be the hash (and filename of local cache) of a key pair.

Other than by having admin privileges, how can a client establish permission to use S3 credentials (which are already known by the gateway server) during a POST request?

Future work

A "resume" API would be useful.