Project

General

Profile

Actions

Idea #15960

open

Computing on external data

Added by Peter Amstutz almost 5 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Target version:
-
Start date:
08/01/2024
Due date:
03/31/2025 (Due in about 3 months)
Story points:
-
Release:
Release relationship:
Auto

Description

Right now, the feature of automatic HTTP download in cwl-runner is effectively fulfilling this function for users (although it copies it into the local keepstore). Users would probably like it if it were expanded to also support copying s3:// URLs.

However, the big idea for this epic is on-demand retrieval from external storage -- we fetch the data from the external system on demand.

Previous designs involved reading all the data to generate content hashes.

The current design is outlined in https://dev.arvados.org/issues/21936 and involves storing locators to external data in the manifest. The block identifiers are based on hashing the locator (and other metadata) instead of the content.


Related issues 4 (4 open0 closed)

Related to Arvados - Feature #8570: [Crunch2] Impure access to object storeNewActions
Related to Arvados - Feature #8569: [Crunch2] Impure mount from host fsNewActions
Related to Arvados - Idea #17348: Example workflow template which streams data from S3 in first step, does some computation steps, and uploads results back to S3.NewActions
Related to Arvados - Idea #21936: Minimum viable external data access featureNewActions
Actions

Also available in: Atom PDF