Project

General

Profile

Actions

Idea #10388

open

Request collections that don't (yet) exist via fuse interface

Added by Joshua Randall over 7 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
Story points:
-
Release:
Release relationship:
Auto

Description

The fuse interface already supports a number of different ways to access collections aside from the low-level portable data hash. Accessing a collection by uuid rather than pdh requires the fuse client to contact the API server to get the current pdh for a collection, and to keep up to date on changes to that pdh over time.

As a user, it would be helpful to have a mechanism within the fuse mount by which I can request a collection that does not yet exist but that can be generated by some entity in the system (pipeline instance, job, or possibly even pipeline template + configuration).

This mechanism could be useful in a number of contexts, but one particularly useful one would be in "just-in-time" transcoding between formats, or performing relatively simple operations on existing data.

For example, I could imagine storing variant data in keep in a compact format such as BCF or even in a structured relational system such as lightning db. However, users may want to access this data in VCF format. Rather than having to manually create a pipeline to convert the data from its stored format to VCF, it would be useful to be able to access it by means of a fuse path such as:

/keep/pipeline_template/convert_variants/input=5a0e057e83846a5ea9a6d8eebe3c1508+875474:input.bcf/output_format=VCF/output.vcf.gz

My expectation would be that attempting to access the above file in the fuse mount would:
- create and run a pipeline_instance using the "convert_variants" pipeline_template as a template with the parameters "input" and "output_format" set as given
- block on any reads on the file until data becomes available (ideally at some point in the future streaming of a partially completed output collection would also be possible as each block is committed, but that should probably be out of scope for the initial implementation)
- make the default for the output collection to be garbage collected (i.e. mark the collection as intermediate or ephemeral, set replication_desired to 0, or don't even save the output pdh to a collection at all)

Actions #1

Updated by Tom Morris over 6 years ago

  • Target version set to Arvados Future Sprints
Actions #2

Updated by Ward Vandewege almost 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Release set to 60
Actions #4

Updated by Peter Amstutz about 2 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF