Objects as pseudo-blocks in Keep » History » Version 3
Peter Amstutz, 05/28/2024 10:18 PM
1 | 1 | Peter Amstutz | h1. Objects as pseudo-blocks in Keep |
---|---|---|---|
2 | |||
3 | Idea for accessing external objects via Keep (specifically S3) |
||
4 | |||
5 | 3 | Peter Amstutz | The thought we've bounced around for a while has been to read the contents of an object, split it into 64 MiB blocks, and record each block hash in a database along with a reference to the object and offset. |
6 | 1 | Peter Amstutz | |
7 | 3 | Peter Amstutz | Here is a different approach to this idea. (Tom floated a version of this at one of our engineering meetings but we didn't fully explore it at the time). |
8 | 1 | Peter Amstutz | |
9 | 3 | Peter Amstutz | h3. Block id |
10 | 1 | Peter Amstutz | |
11 | 3 | Peter Amstutz | For an s3 object of 1234 bytes located at s3://bucket/key |
12 | |||
13 | 1 | Peter Amstutz | ffffffffffffffffffffffffffffffff+512+B(base64 encode of s3://bucket/key)+C256 |
14 | |||
15 | 3 | Peter Amstutz | By my research, some values such as ETag can be MD5 in certain circumstances but this isn't true in general. So for these pseudo-blocks, I propose deriving the hash from (size, @+B@, @+C@) hints. |
16 | 1 | Peter Amstutz | |
17 | 3 | Peter Amstutz | For S3 specifically, if the bucket supports versioning and we use @?versionId=@ on all URLs, blocks can be treated as immutable. |
18 | |||
19 | In this example: |
||
20 | |||
21 | 1 | Peter Amstutz | * It is 512 bytes long. |
22 | 3 | Peter Amstutz | * The hint @+B@ means data should be fetched from a s3:// URL. In this case it is base64 encoded (this is necessary to match our locator syntax). |
23 | 1 | Peter Amstutz | * The hint @+C@ means read from offset 256 bytes. |
24 | 2 | Peter Amstutz | |
25 | 3 | Peter Amstutz | So this describes the range of bytes from 256 to 768. |
26 | |||
27 | h3. Block stream |
||
28 | |||
29 | 1 | Peter Amstutz | Large files can be split, e.g. |
30 | |||
31 | ffffffffffffffffffffffffffffffff+67108864+B(base64 encode of s3://bucket/key)+C0 ffffffffffffffffffffffffffffffff+67108864+B(base64 encode of s3://bucket/key)+C67108864 ffffffffffffffffffffffffffffffff+67108864+B(base64 encode of s3://bucket/key)+C134217728 |
||
32 | |||
33 | However this repeats the the +B portion a bunch of times, so we could allow the manifest to describe oversized blocks: |
||
34 | |||
35 | ffffffffffffffffffffffffffffffff+1000000000+B(base64 encode of s3://bucket/key)+C0 |
||
36 | |||
37 | 3 | Peter Amstutz | Implementation-wise, this would be split into the previous example of 64 MiB chunks at runtime when the manifest is loaded (and re-compressed when the manifest is saved). The block cache would need to use the full locator (with +B and +C) or have some other means of distinguishing regular keep blocks from these external reference pseudo-blocks. |
38 | 1 | Peter Amstutz | |
39 | 3 | Peter Amstutz | h3. Keepstore support |
40 | 1 | Peter Amstutz | |
41 | 3 | Peter Amstutz | Add support for locators of this type to Keepstore. Keepstore already needs to be able to interact with S3 buckets. |
42 | |||
43 | 1 | Peter Amstutz | Keepstore would need to be able to read the buckets. This could be done either with a blanket policy (allow keepstore/compute nodes to read specific buckets) and/or by adding a feature to store AWS credentials in Arvados in a way such that Keepstore, having the user's API token, is able to fetch them and use them (such as on the API token record). |
44 | |||
45 | 3 | Peter Amstutz | This interacts awkwardly with Arvados sharing; sharing a collection doesn't mean you can actually read it, without additional features. |
46 | 2 | Peter Amstutz | |
47 | 3 | Peter Amstutz | h3. SDK support |
48 | 1 | Peter Amstutz | |
49 | 3 | Peter Amstutz | This approach limits the amount of S3-specific code directly in the client -- the goal should be to avoid having to import boto3. |
50 | |||
51 | The Collection class gets a new "import_from_s3()" method (or maybe an overload of the "copy" method) which takes the s3:// URL. This contacts the Keepstore server, provides the, s3 URL and gets back the appropriately formatted block locator. Keepstore should check that the object exists and the user can access it, and get the current versionId. |
||
52 | |||
53 | h3. Advantages |
||
54 | |||
55 | * This strategy is similar to how we approach federation, which reduces the number of dramatic changes in the architecture |
||
56 | 1 | Peter Amstutz | * If locators of this type are supported by Keepstore, then Go and Python SDKs require relatively few changes (they continue to blocks from Keepstore). |
57 | 2 | Peter Amstutz | * Does not require downloading and indexing files |
58 | 3 | Peter Amstutz | * Can still get a unique PDH for the collection |
59 | * Can mix S3 objects and regular Keep objects, Arvados now becomes generally useful for organizing data in buckets (although changes in Keep don't propagate down to the bucket, but moving data once it has been written is crappy in S3 anyway so you don't do it). |
||
60 | 2 | Peter Amstutz | |
61 | 3 | Peter Amstutz | h3. Disadvantages |
62 | 2 | Peter Amstutz | |
63 | * Can't verify file contents. |
||
64 | 3 | Peter Amstutz | * Requires working with AWS access control, whether by granting blanket read access ahead of time to certain specific buckets, storing credentials, or some other mechanism we haven't designed yet |
65 | * Sharing a collection with another person requires granting permission in both Arvados and AWS. |
||
66 | * The fact that a given manifest contains references to S3 objects is opaque to the user and could produce confusing errors |
||
67 | * Given a s3:// id to an object, can't efficiently find what collections use it (but this is a feature currently missing from Keep in general, keep-balance could do something here if needed, or we implement a block index in the future) |