Keep Proxy Specification » History » Version 1

Peter Amstutz, 04/28/2014 10:16 AM

1 1 Peter Amstutz
h1. Reverse Keep Proxy
2 1 Peter Amstutz
3 1 Peter Amstutz
h2. Problem
4 1 Peter Amstutz
5 1 Peter Amstutz
Need to be able to automatically upload huge (+1 TiB) datasets into Arvados.  Current proposed solution is to upload the data to a staging area and then put the data into Keep.  On further consideration, this solution is inadequate for a number of reasons:
6 1 Peter Amstutz
* Must set aside staging area big enough to accommodate large uploads.
7 1 Peter Amstutz
* When uploads are not occurring, this empty space just sits around, costing money.
8 1 Peter Amstutz
* Amazon has a 1 TiB limit on EBS volumes, which means we can't accept +1 TiB datasets, unless we create a volume-spanning partitions
9 1 Peter Amstutz
* Multiple users uploading to the same staging partition can end up in a starvation deadlock when if the volume fills up.
10 1 Peter Amstutz
* Some of these problems could be addressed by allocating/deallocating volumes on the fly, but this adds significant complexity.
11 1 Peter Amstutz
* Once data is uploaded, it still needs to be copied into Keep, which adds additional wait time from when the data is uploaded to when the data is actually ready to use.
12 1 Peter Amstutz
13 1 Peter Amstutz
h2. Solution
14 1 Peter Amstutz
15 1 Peter Amstutz
Provide a Keep client that sends blocks to a reverse Keep proxy, which forwards the blocks to appropriate internal Keep servers.  
16 1 Peter Amstutz
* Doesn't require staging except in RAM of the Keep proxy.
17 1 Peter Amstutz
* No dataset limits except Keep's overall capacity
18 1 Peter Amstutz
* Fewer contention problems (although many uploaders could overwhelm the proxy node...)
19 1 Peter Amstutz
* Data is available immediately once upload is completed
20 1 Peter Amstutz
* This is the right thing to do in the long term anyway.  We shouldn't waste our time with messy hacks.
21 1 Peter Amstutz
22 1 Peter Amstutz
h2. Approach
23 1 Peter Amstutz
24 1 Peter Amstutz
Uploading functionality already largely exists