Project

General

Profile

Actions

Idea #21936

open

Minimum viable external data access feature

Added by Peter Amstutz 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Keep
Target version:
Start date:
Due date:
Story points:
-

Description

  • Manifest format extended to support a link to an external resource as a block "hint", also a hint with the offset on the external resource
    • The s3 object version id must be included to the locator if the bucket has versioning enabled (as ?versionId=)
    • The block "size" needs to be the content size, not the size of the string that was hashed to get the md5 (see below)
  • Keepstore gets an API which takes an external resource URL (s3://) and verifies that the object is accessible, fetches metadata, generates the md5, and returns a manifest stream fragment
    • The block identifier is a md5sum based on the locator, etag, offset and length
    • For versioned buckets, it should include the version in the locator, as ?versionId= because we want to only retrieve that exact version
    • For non-versioned buckets, the metadata will include the etag, so if the
  • Python SDK method which takes external resource URL, calls keepstore to get a manifest stream fragment
  • Keepstore supports fetching blocks that have an external resource hint
    • Need to compute the md5 based on locator, version, etag, offset and length of the real resource to ensure that the external object hasn't changed, check if it matches the block from the collection, return an error (could be 404, 409 Conflict or 410 Gone)
  • Python and Go SDK handle blocks with external resource hints, where the MD5 corresponds to a hash of the locator hint and not the content itself
    • Cache management might need some attention
  • arvados-cwl-runner supports s3 object inputs by using this API to create collection with links to external resources
  • Keep-balance ignores blocks with external links
Assumptions:
  • Keepstore and compute nodes have permission to read s3 buckets where resources are located via IAM instance roles
Possibly required, TBD:
  • Store credentials associated with S3 buckets in Arvados config.yml, which are used by keepstore when IAM instance roles are not available.

Related issues

Related to Arvados Epics - Idea #15960: Computing on external dataNew08/01/202403/31/2025Actions
Actions #1

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 4 months ago

  • Related to Idea #15960: Computing on external data added
Actions

Also available in: Atom PDF