Sprint Review: the Pipeline Factory
Review of the May 28 engineering sprint on the Curoverse Arvados genomic analysis platform.
Dear Arvadans near and far:
Our May development sprint yielded lots of new features for you! This month, we focused on enhancing the Arvados pipeline factory. Pipelines are at the heart of the Arvados workflow, and we know how important it is for users to have an intuitive, natural process for managing both pipelines and the data collections that they're run on. To that end, we're really excited with the new pipeline and collection tools we have to offer you:
Names and Projects. You can now give names to pipelines, collections, templates and specimens, and group them into projects. This way, you can choose your own organizational scheme for finding and managing data, and can easily keep track of large amounts of data, even hundreds of data sets and dozens of projects.
Deleting pipeline instances. It's important to have a way of managing pipeline instances that are no longer needed. Now, once you're done with a pipeline instance, you can delete it -- without harming the provenance graph for data generated by that instance. You still have full data provenance for all of your results.
Streamlined data upload. Uploading data to Keep has always been a little cumbersome, requiring that it be uploaded to a staging server before copying it into Keep. For uploading gigabytes and terabytes of data, we knew we would need something better, and now we have it: a Keep upload proxy which receives uploads directly from your workstation or local machine, and stores them on a Keep server (with multiple replicas, if desired).
One of the other complications with large-scale data upload is how to handle interrupted uploads. To that end, the arv-put command now keeps track of the state of an upload in progress, and if restarted after an interruption, it will resume from where it left off. You no longer need worry about having to start an 18-hour upload from scratch!Improved collections view. We have added lots of information to the collections view page, including:
- Metadata and provenance
- Links from a job result to the job that produced it
- Permission and sharing information: who's allowed to read a collection
- Files in a collection displayed in a tree
Real-time log updates. You can now watch the log output from a pipeline while the pipeline is still executing.
We're charging ahead with the next milestone and are incredibly excited about bringing you Arvados 1.0 later this year. In the meantime, please always feel free to leave us your feedback via e-mail or IRC!