Actions
Idea #21936
openMinimum viable external data access feature
Start date:
Due date:
Story points:
-
Description
- Manifest format extended to support a link to an external resource as a block "hint", also a hint with the offset on the external resource
- The s3 object version id must be included to the locator if the bucket has versioning enabled (as
?versionId=
) - The block "size" needs to be the content size, not the size of the string that was hashed to get the md5 (see below)
- The s3 object version id must be included to the locator if the bucket has versioning enabled (as
- Keepstore gets an API which takes an external resource URL (s3://) and verifies that the object is accessible, fetches metadata, generates the md5, and returns a manifest stream fragment
- The block identifier is a md5sum based on the locator, etag, offset and length
- For versioned buckets, it should include the version in the locator, as
?versionId=
because we want to only retrieve that exact version - For non-versioned buckets, the metadata will include the etag, so if the
- Python SDK method which takes external resource URL, calls keepstore to get a manifest stream fragment
- Keepstore supports fetching blocks that have an external resource hint
- Need to compute the md5 based on locator, version, etag, offset and length of the real resource to ensure that the external object hasn't changed, check if it matches the block from the collection, return an error (could be 404, 409 Conflict or 410 Gone)
- Python and Go SDK handle blocks with external resource hints, where the MD5 corresponds to a hash of the locator hint and not the content itself
- Cache management might need some attention
- arvados-cwl-runner supports s3 object inputs by using this API to create collection with links to external resources
- Keep-balance ignores blocks with external links
- Keepstore and compute nodes have permission to read s3 buckets where resources are located via IAM instance roles
- Store credentials associated with S3 buckets in Arvados config.yml, which are used by keepstore when IAM instance roles are not available.
Related issues
Actions