Keep Proxy Specification » History » Version 1

Version 1/14 - Next » - Current version
Peter Amstutz, 04/28/2014 10:16 AM


Reverse Keep Proxy

Problem

Need to be able to automatically upload huge (+1 TiB) datasets into Arvados. Current proposed solution is to upload the data to a staging area and then put the data into Keep. On further consideration, this solution is inadequate for a number of reasons:
  • Must set aside staging area big enough to accommodate large uploads.
  • When uploads are not occurring, this empty space just sits around, costing money.
  • Amazon has a 1 TiB limit on EBS volumes, which means we can't accept +1 TiB datasets, unless we create a volume-spanning partitions
  • Multiple users uploading to the same staging partition can end up in a starvation deadlock when if the volume fills up.
  • Some of these problems could be addressed by allocating/deallocating volumes on the fly, but this adds significant complexity.
  • Once data is uploaded, it still needs to be copied into Keep, which adds additional wait time from when the data is uploaded to when the data is actually ready to use.

Solution

Provide a Keep client that sends blocks to a reverse Keep proxy, which forwards the blocks to appropriate internal Keep servers.
  • Doesn't require staging except in RAM of the Keep proxy.
  • No dataset limits except Keep's overall capacity
  • Fewer contention problems (although many uploaders could overwhelm the proxy node...)
  • Data is available immediately once upload is completed
  • This is the right thing to do in the long term anyway. We shouldn't waste our time with messy hacks.

Approach

Uploading functionality already largely exists