Keep Proxy Specification » History » Version 14

Ward Vandewege, 02/02/2015 09:17 PM

1 14 Ward Vandewege
h1. Keep Proxy Specification
2 1 Peter Amstutz
3 13 Peter Amstutz
_Archived for informational purposes.  The proposal described here is now implemented in arvados/services/keep/src/
4 13 Peter Amstutz
5 1 Peter Amstutz
h2. Problem
6 1 Peter Amstutz
7 1 Peter Amstutz
Need to be able to automatically upload huge (+1 TiB) datasets into Arvados.  Current proposed solution is to upload the data to a staging area and then put the data into Keep.  On further consideration, this solution is inadequate for a number of reasons:
8 1 Peter Amstutz
* Must set aside staging area big enough to accommodate large uploads.
9 1 Peter Amstutz
* When uploads are not occurring, this empty space just sits around, costing money.
10 1 Peter Amstutz
* Amazon has a 1 TiB limit on EBS volumes, which means we can't accept +1 TiB datasets, unless we create a volume-spanning partitions
11 1 Peter Amstutz
* Multiple users uploading to the same staging partition can end up in a starvation deadlock when if the volume fills up.
12 1 Peter Amstutz
* Some of these problems could be addressed by allocating/deallocating volumes on the fly, but this adds significant complexity.
13 1 Peter Amstutz
* Once data is uploaded, it still needs to be copied into Keep, which adds additional wait time from when the data is uploaded to when the data is actually ready to use.
14 1 Peter Amstutz
15 1 Peter Amstutz
h2. Solution
16 1 Peter Amstutz
17 1 Peter Amstutz
Provide a Keep client that sends blocks to a reverse Keep proxy, which forwards the blocks to appropriate internal Keep servers.  
18 1 Peter Amstutz
* Doesn't require staging except in RAM of the Keep proxy.
19 1 Peter Amstutz
* No dataset limits except Keep's overall capacity
20 1 Peter Amstutz
* Fewer contention problems (although many uploaders could overwhelm the proxy node...)
21 1 Peter Amstutz
* Data is available immediately once upload is completed
22 1 Peter Amstutz
* This is the right thing to do in the long term anyway.  We shouldn't waste our time with messy hacks.
23 1 Peter Amstutz
24 1 Peter Amstutz
h2. Approach
25 1 Peter Amstutz
26 8 Peter Amstutz
# Develop a subset Arvados Go SDK that supports accessing API server and can write to Keep server (reading from Keep is out of scope).
27 2 Peter Amstutz
** Read files in 64 MiB blocks and calculate hashes
28 4 Peter Amstutz
** Pack small files into a single block
29 2 Peter Amstutz
** Put 64 MiB blocks to Keep server over HTTPS
30 2 Peter Amstutz
** Create manifest (should be normalized form)
31 2 Peter Amstutz
** Write manifest to Keep
32 2 Peter Amstutz
** Use Google API client to talk to API server to create collection, metadata links
33 7 Peter Amstutz
# Develop uploader program in Go to recursively upload a directory structure
34 6 Peter Amstutz
** Take API server, API token, directory path on the command line (+ additional metadata links to set on the collection after it is completed)
35 2 Peter Amstutz
** Should be self-contained static x64 ELF binary with minimal dependencies that will run on any modern x64 Linux.
36 7 Peter Amstutz
** Use Go Keep client library to upload blocks, create manifest, upload manifest to API server, add metadata links.
37 2 Peter Amstutz
** Should checkpoint during upload so that upload can be canceled and resumed.
38 3 Peter Amstutz
# Reverse Keep Proxy
39 2 Peter Amstutz
** Publicly accessible head node providing write access into Keep (read access is out of scope for this task)
40 9 Peter Amstutz
** List proxy contact info in discovery document
41 2 Peter Amstutz
** Check API token to ensure client has permission to write
42 2 Peter Amstutz
** Accept blocks from client, forward them to internal Keep cluster.  Extend existing Keep Go server by writing a new volume backend that writes to the appropriate internal Keep servers instead of to the disk.
43 10 Peter Amstutz
** Block hash, user uuid for each block logged to API server
44 2 Peter Amstutz
** Writing to internal Keep servers and API server will use Arvados Go SDK
45 2 Peter Amstutz
# API server
46 11 Peter Amstutz
** API call allowing normal users to create special user accounts that use a combination of limited permissions and scopes to restrict to uploading tasks.  Scopes alone are not powerful enough because a scope cannot restrict the uploader to only creating links about collections known to the uploader.
47 2 Peter Amstutz
** Restricted to a few tasks, such as creating collections, creating metadata links about that collection.
48 2 Peter Amstutz
** Restricted account is owned by the Arvados user, so user can see and change everything the uploader account owns.
49 1 Peter Amstutz
** Can deactivate uploader account when done with it.
50 12 Peter Amstutz
** (This task can probably separated from tasks 1-3 but is necessary to support delegation)