Feature #13126

[keep] Investigate using signed URLs to delegate access to cloud buckets

Added by Peter Amstutz 10 months ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Currently keepstore is the gateway to the backend object store. All data has to flow through the keepstores. This is a bottleneck which is usually addressed by ops using more expensive keepstore nodes (to get more bandwidth) or adding keepstore nodes.

Some object storage systems such as S3 have the concept of "signed URLs". This is similar to Arvados signing tokens, a secret which gives time-limited access to read a specific object.

Investigate the performance/scaling behavior of the following alternate flow:

  1. client requests a block from keepstore
  2. keepstore receives and validates the request as normal
  3. keepstore requests a signed URL from backend object store for the block
  4. keepstore returns 302 Redirect to signed url to client
  5. client receives redirect and makes a new request to fetch the block content from the signed URL
  6. client checks block md5sum and proceeds as normal, or tries another keepstore if there is an error

The benefit of this approach is that the data transfer load is moved off keepstore and nodes compute communicate directly with the object store. This should scale better. However, there is also a potential latency penalty in adding the extra "request signed URL and redirect" operation.

On AWS, signed URLs can also be used for PUT operations. AWS permits signed URLs that assert that only data that hashes to a specific MD5 will be accepted. However, keepstore needs to verify the block and return an Arvados signing token, it is not clear how that would work with S3 signed URLs.

Reference:

https://docs.aws.amazon.com/sdk-for-go/v1/developer-guide/s3-example-presigned-urls.html

History

#1 Updated by Peter Amstutz 10 months ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz 10 months ago

  • Subject changed from Investigate using signed URLs to delegate access to cloud buckets to [keep] Investigate using signed URLs to delegate access to cloud buckets
  • Description updated (diff)
  • Status changed from In Progress to New

#3 Updated by Peter Amstutz 10 months ago

  • Description updated (diff)

#4 Updated by Tom Morris 10 months ago

Rather than starting with an answer, I'd like to see us start with a question or problem statement. I'm my mind the goal is to remove all bottlenecks in accessing the storage layer. All cloud vendors provide highly scalable storage fabrics with reliable transport, integrity checksums, and permission mechanisms. To the extent that we can, we should be leverage those capabilities rather than duplicating them.

Also available in: Atom PDF