Project

General

Profile

Actions

Idea #15960

open

Computing on external data

Added by Peter Amstutz over 4 years ago. Updated 12 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Target version:
-
Start date:
08/01/2024
Due date:
12/31/2024 (Due in about 5 months)
Story points:
-
Release:
Release relationship:
Auto

Description

Right now, the feature of automatic HTTP download in cwl-runner is effectively fulfilling this function for users (although it copies it into the local keepstore). Users would probably like it if it were expanded to also support copying s3:// URLs.

However, the big idea for this epic is on-demand retrieval from external storage -- we fetch the data from the external system on demand.

Previous designs involved reading all the data to generate content hashes.

The current design is outlined in https://dev.arvados.org/issues/21936 and involves storing locators to external data in the manifest. The block identifiers are based on hashing the locator (and other metadata) instead of the content.


Related issues

Related to Arvados - Feature #8570: [Crunch2] Impure access to object storeNewActions
Related to Arvados - Feature #8569: [Crunch2] Impure mount from host fsNewActions
Related to Arvados - Idea #17348: Example workflow template which streams data from S3 in first step, does some computation steps, and uploads results back to S3.NewActions
Related to Arvados - Idea #21936: Minimum viable external data access featureNewActions
Actions

Also available in: Atom PDF