Objects as pseudo-blocks in Keep¶
Idea for accessing external objects via Keep (specifically S3)
The thought we've bounced around for a while has been to read the contents of an object, split it into 64 MiB blocks, and record each block hash in a database along with a reference to the object and offset.
Here is a different approach to this idea. (Tom floated a version of this at one of our engineering meetings but I think I didn't like it / we didn't fully explore it at the time).
Block id¶
For an s3 object of 1234 bytes located at s3://bucket/key
ffffffffffffffffffffffffffffffff+512+B(base64 encode of s3://bucket/key)+C256
By my research, some values such as ETag can be MD5 in certain circumstances but this isn't true in general. So for these pseudo-blocks, I propose deriving the hash from (size, +B
, +C
) hints.
For S3 specifically, if the bucket supports versioning and we use ?versionId=
on all URLs, blocks can be treated as immutable.
In this example:
- It is 512 bytes long.
- The hint
+B
means data should be fetched from a s3:// URL. In this case it is base64 encoded (this is necessary to match our locator syntax). - The hint
+C
means read from offset 256 bytes.
So this describes the range of bytes from 256 to 768.
Block stream¶
Large files can be split, e.g.
ffffffffffffffffffffffffffffffff+67108864+B(base64 encode of s3://bucket/key)+C0 ffffffffffffffffffffffffffffffff+67108864+B(base64 encode of s3://bucket/key)+C67108864 ffffffffffffffffffffffffffffffff+67108864+B(base64 encode of s3://bucket/key)+C134217728
However this repeats the the +B portion a bunch of times, so we could allow the manifest to describe oversized blocks:
ffffffffffffffffffffffffffffffff+1000000000+B(base64 encode of s3://bucket/key)+C0
Implementation-wise, this would be split into the previous example of 64 MiB chunks at runtime when the manifest is loaded (and re-compressed when the manifest is saved). The block cache would need to use the full locator (with +B and +C) or have some other means of distinguishing regular keep blocks from these external reference pseudo-blocks.
Keepstore support¶
Add support for locators of this type to Keepstore. Keepstore already needs to be able to interact with S3 buckets.
Keepstore would need to be able to read the buckets. This could be done either with a blanket policy (allow keepstore/compute nodes to read specific buckets) and/or by adding a feature to store AWS credentials in Arvados in a way such that Keepstore, having the user's API token, is able to fetch them and use them (such as on the API token record).
This interacts awkwardly with Arvados sharing; sharing a collection doesn't mean you can actually read it, without additional features.
Instead of using hints, could store blocks with content +size +B +C
and if we read from Keepstore, it notices that the block matches this format and reads from S3 instead.
SDK support¶
This approach limits the amount of S3-specific code directly in the client -- the goal should be to avoid having to import boto3.
The Collection class gets a new "import_from_s3()" method (or maybe an overload of the "copy" method) which takes the s3:// URL. This contacts the Keepstore server, provides the, s3 URL and gets back the appropriately formatted block locator. Keepstore should check that the object exists and the user can access it, and get the current versionId.
Advantages¶
- This strategy is similar to how we approach federation, which reduces the number of dramatic changes in the architecture
- If locators of this type are supported by Keepstore, then Go and Python SDKs require relatively few changes (they continue to blocks from Keepstore).
- Does not require downloading and indexing files
- Can still get a unique PDH for the collection
- Can mix S3 objects and regular Keep objects, Arvados now becomes generally useful for organizing data in buckets (although changes in Keep don't propagate down to the bucket, but moving data once it has been written is crappy in S3 anyway so you don't do it).
Disadvantages¶
- Can't verify file contents.
- Requires working with AWS access control, whether by granting blanket read access ahead of time to certain specific buckets, storing credentials, or some other mechanism we haven't designed yet
- Sharing a collection with another person requires granting permission in both Arvados and AWS.
- The fact that a given manifest contains references to S3 objects is opaque to the user and could produce confusing errors
- Given a s3:// id to an object, can't efficiently find what collections use it (but this is a feature currently missing from Keep in general, keep-balance could do something here if needed, or we implement a block index in the future)
vs alternatives¶
vs copying¶
- slow, have to read and write all the data
- have to pay for duplicated storage
vs indexing external data as keep blocks¶
- slow, have to read data to index it
- have to store mapping of block ids to external references somewhere
- keepstore still needs to have permissions/credentials to read buckets
vs collection as passthrough to bucket¶
- components that work with manifests/Keep directly (eg Python SDK) have to be rewritten to use keep-web
- would probably end up relying on a 3rd party FUSE driver (eg s3fs or aws-mountpoint) in crunch-run
- no good way assign a portable data hash or do collection versioning because bucket contents can change outside of our control
- copying, moving, renaming files either doesn't work or is expensive compared to regular Keep since S3 can't do those things efficiently
- keep-web still needs to have permissions/credentials to read buckets
- if this ends up being a list of objects (rather than an object prefix) it is effectively another version of the same proposal
How a Release highlight might be written¶
Keep collections can now reference data stored as external S3 objects. This feature allows workflows, Workbench, SDKs and command line tools to get data stored in external S3 buckets, avoiding the need for a separate copy step. With this feature, users are able to both run workflows and use the full set of data management features offered by Arvados to keep track of data stored in S3, including efficient move, copy, rename of files, organization of data into collections and projects, dataset-level versioning, and collection metadata; all without modifying the underlying S3 bucket.
Users can import objects from S3 using the Python SDK, Workbench, or as inputs to arvados-cwl-runner. Users provide a list of objects and/or prefixes and Arvados will construct a collection that contains references to those objects. From there, users will be able to work with data as standard Keep collections, and Keep will fetch data from external buckets as needed.
Updated by Peter Amstutz 7 months ago · 7 revisions