Story #10388

Request collections that don't (yet) exist via fuse interface

Added by Joshua Randall over 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
10/27/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

The fuse interface already supports a number of different ways to access collections aside from the low-level portable data hash. Accessing a collection by uuid rather than pdh requires the fuse client to contact the API server to get the current pdh for a collection, and to keep up to date on changes to that pdh over time.

As a user, it would be helpful to have a mechanism within the fuse mount by which I can request a collection that does not yet exist but that can be generated by some entity in the system (pipeline instance, job, or possibly even pipeline template + configuration).

This mechanism could be useful in a number of contexts, but one particularly useful one would be in "just-in-time" transcoding between formats, or performing relatively simple operations on existing data.

For example, I could imagine storing variant data in keep in a compact format such as BCF or even in a structured relational system such as lightning db. However, users may want to access this data in VCF format. Rather than having to manually create a pipeline to convert the data from its stored format to VCF, it would be useful to be able to access it by means of a fuse path such as:

/keep/pipeline_template/convert_variants/input=5a0e057e83846a5ea9a6d8eebe3c1508+875474:input.bcf/output_format=VCF/output.vcf.gz

My expectation would be that attempting to access the above file in the fuse mount would:
- create and run a pipeline_instance using the "convert_variants" pipeline_template as a template with the parameters "input" and "output_format" set as given
- block on any reads on the file until data becomes available (ideally at some point in the future streaming of a partially completed output collection would also be possible as each block is committed, but that should probably be out of scope for the initial implementation)
- make the default for the output collection to be garbage collected (i.e. mark the collection as intermediate or ephemeral, set replication_desired to 0, or don't even save the output pdh to a collection at all)

History

#1 Updated by Tom Morris almost 2 years ago

  • Target version set to Arvados Future Sprints

Also available in: Atom PDF