Keep Proxy Specification » History » Version 8

Peter Amstutz, 04/28/2014 12:43 PM

1 1 Peter Amstutz
h1. Reverse Keep Proxy
2 1 Peter Amstutz
3 1 Peter Amstutz
h2. Problem
4 1 Peter Amstutz
5 1 Peter Amstutz
Need to be able to automatically upload huge (+1 TiB) datasets into Arvados.  Current proposed solution is to upload the data to a staging area and then put the data into Keep.  On further consideration, this solution is inadequate for a number of reasons:
6 1 Peter Amstutz
* Must set aside staging area big enough to accommodate large uploads.
7 1 Peter Amstutz
* When uploads are not occurring, this empty space just sits around, costing money.
8 1 Peter Amstutz
* Amazon has a 1 TiB limit on EBS volumes, which means we can't accept +1 TiB datasets, unless we create a volume-spanning partitions
9 1 Peter Amstutz
* Multiple users uploading to the same staging partition can end up in a starvation deadlock when if the volume fills up.
10 1 Peter Amstutz
* Some of these problems could be addressed by allocating/deallocating volumes on the fly, but this adds significant complexity.
11 1 Peter Amstutz
* Once data is uploaded, it still needs to be copied into Keep, which adds additional wait time from when the data is uploaded to when the data is actually ready to use.
12 1 Peter Amstutz
13 1 Peter Amstutz
h2. Solution
14 1 Peter Amstutz
15 1 Peter Amstutz
Provide a Keep client that sends blocks to a reverse Keep proxy, which forwards the blocks to appropriate internal Keep servers.  
16 1 Peter Amstutz
* Doesn't require staging except in RAM of the Keep proxy.
17 1 Peter Amstutz
* No dataset limits except Keep's overall capacity
18 1 Peter Amstutz
* Fewer contention problems (although many uploaders could overwhelm the proxy node...)
19 1 Peter Amstutz
* Data is available immediately once upload is completed
20 1 Peter Amstutz
* This is the right thing to do in the long term anyway.  We shouldn't waste our time with messy hacks.
21 1 Peter Amstutz
22 1 Peter Amstutz
h2. Approach
23 1 Peter Amstutz
24 8 Peter Amstutz
# Develop a subset Arvados Go SDK that supports accessing API server and can write to Keep server (reading from Keep is out of scope).
25 2 Peter Amstutz
** Read files in 64 MiB blocks and calculate hashes
26 4 Peter Amstutz
** Pack small files into a single block
27 2 Peter Amstutz
** Put 64 MiB blocks to Keep server over HTTPS
28 2 Peter Amstutz
** Create manifest (should be normalized form)
29 2 Peter Amstutz
** Write manifest to Keep
30 2 Peter Amstutz
** Use Google API client to talk to API server to create collection, metadata links
31 7 Peter Amstutz
# Develop uploader program in Go to recursively upload a directory structure
32 6 Peter Amstutz
** Take API server, API token, directory path on the command line (+ additional metadata links to set on the collection after it is completed)
33 2 Peter Amstutz
** Should be self-contained static x64 ELF binary with minimal dependencies that will run on any modern x64 Linux.
34 7 Peter Amstutz
** Use Go Keep client library to upload blocks, create manifest, upload manifest to API server, add metadata links.
35 2 Peter Amstutz
** Should checkpoint during upload so that upload can be canceled and resumed.
36 3 Peter Amstutz
# Reverse Keep Proxy
37 2 Peter Amstutz
** Publicly accessible head node providing write access into Keep (read access is out of scope for this task)
38 2 Peter Amstutz
** List proxy node in discovery document
39 2 Peter Amstutz
** Check API token to ensure client has permission to write
40 2 Peter Amstutz
** Accept blocks from client, forward them to internal Keep cluster.  Extend existing Keep Go server by writing a new volume backend that writes to the appropriate internal Keep servers instead of to the disk.
41 2 Peter Amstutz
** Hash and user account associated with each upload block logged to API server
42 2 Peter Amstutz
** Writing to internal Keep servers and API server will use Arvados Go SDK
43 2 Peter Amstutz
# API server
44 2 Peter Amstutz
** API call allowing normal users to create of special user accounts that use a combination of limited permissions and scopes to restrict to uploading tasks.  Scopes alone are not powerful enough because a scope cannot restrict the uploader to only creating links about collections known to the uploader.
45 2 Peter Amstutz
** Restricted to a few tasks, such as creating collections, creating metadata links about that collection.
46 2 Peter Amstutz
** Restricted account is owned by the Arvados user, so user can see and change everything the uploader account owns.
47 2 Peter Amstutz
** Can deactivate uploader account when done with it.