Project

General

Profile

Idea #21936

Updated by Peter Amstutz about 2 months ago

* Manifest format extended to support a link to an external resource as a block "hint", also a hint with the offset on the external resource 
 ** The block "size" needs to be the content size, not the size of the string that was hashed to get the md5 (see below) 
 * Keepstore gets an API which takes an external resource URL (s3://) and verifies that the object is accessible, fetches metadata, generates the md5, and returns a manifest stream fragment 
 ** The block identifier is a md5sum based on the locator, version, etag, offset and length 
 ** For versioned buckets, it should include the version in the locator 
 ** For non-versioned buckets, the metadata will include the etag, so if the should  
 * Python SDK method which takes external resource URL, calls keepstore to get a manifest stream fragment 
 * Keepstore supports fetching blocks that have an external resource hint 
 ** Need to compute the md5 based on locator, version, etag, offset and length of the real resource to ensure that the external object hasn't changed, check if it matches the block from the collection, return an error (could be 404, 409 Conflict or 410 Gone) 
 * Python and Go SDK handle blocks with external resource hints, where the MD5 corresponds to a hash of the locator hint and not the content itself 
 ** Cache management might need some attention 
 * arvados-cwl-runner supports s3 object inputs by using this API to create collection with links to external resources 
 * Keep-balance ignores blocks with external links 

 Assumptions: 
 * Keepstore and compute nodes have permission to read s3 buckets where resources are located via IAM instance roles 

 Possibly required, TBD: 
 * Store credentials associated with S3 buckets in Arvados config.yml, which are used by keepstore when IAM instance roles are not available. 

Back