|Alexander Wait Zaranek|
On June 20th, 2013, Arvados was recognized by the White House Office of Science and Technology Policy at its 'Open Science' Champions of Change event. After evaluating several hundred nominations, the OSTP had selected the Arvados Project as one of 12 posters to present at the Eisenhower Executive Office Building.
We’re psyched the OSTP honored the contribution to open science that is being made by the Arvados project. Genomics was featured prominently at the event, and the recognition that open source software will be a key foundation to open science efforts and personalized medicine was very cool.
The poster is available for download here.
Arvados was represented by three of us: Dr. Alexander Wait Zaranek, Ward Vandewege, and Jonathan Sheffi. Here are a few pictures of us at the White House:
We’re excited to be launching the Arvados Project website today. Arvados is born out of seven years of research and technology development to manage and analyze genomic data.
In 2006, we first started thinking about the genomic big data problem when we began planning the Personal Genome Project informatics system. The question we asked was simple: “How do we create a modern computing platform for managing and analyzing an exabyte for genomic data distributed in datacenters around the world?”
We quickly decided we needed to abandon the traditional high-performance computing architecture that combines networked attached storage with a compute cluster and a batch-queuing system like Sun Grid Engine. There were simply too many problems with scaling and using this kind of a system. Cost being one of the largest.
So we turned to innovations in the web industry (especially from Google) for approaches to large scale distributed computing with big data sets. These strategies included leveraging very, low-cost commodity hardware, virtualization, distributed object storage, MapReduce, in-memory databases, and other similar technologies. Then we focused on how to apply these technology strategies to the unique requirements of genomic data and biomedical computing.
After several years of development the result is a system that we’re using now at the Personal Genome Project to power two clusters with more than 300 TB of storage and 500 cores.
The system uses a horizontally scaling hardware architecture with uniform nodes that combine storage and compute on each node, which means it’s easy to start small and grow to very large scale. It’s also designed to facilitate distributed computations across clusters running in different datacenters.
Now we’re taking all the software we’ve built for the PGP and releasing it as Arvados under the AGPLv3 license. Looking forward, we see Arvados as a platform that can run on top of cloud operating systems such as OpenStack, Amazon Web Services and others. It provides a common set of services and APIs for managing omic data and running pipelines that analyze that data.
In order to realize the promise of genomics and precision medicine, we need to take the core storage, data management, computation, data sharing, and distributed computing layer of the biomedical computing stack and make it more consistent across organizations, which is why we created the Arvados.org project. Working together we’ll utilize our limited resources as a community more effectively, lower operating costs, and enable sharing of data and applications across organizations.
We know that we don’t have all the answers, but we think we have a very good foundation to build on. In the same way that other open source projects have helped to establish common infrastructure in other industries, we’re hoping Arvados will become a platform the biomedical community can build on, extend, and adapt to a new world where we’re using petabytes and ultimately exabytes of genomic and other biomedical data to deliver better patient care.
Everyone on the project looks forward to your comments, critiques and contributions.
Adam Berrey (CFI) kicks off the Summit
The Fall 2013 Arvados Summit took place in Cambridge, MA on Tuesday, October 22nd, from 1pm to 5pm. Over 30 people attended, including bioinformaticians, developers, clinicians, commercial leaders, pipeline providers, and tool developers.
The Summit kicked off with project updates from Adam Berrey & Tom Clegg as well as breakout discussion sessions to address emerging needs for the platform. Participants also addressed the nascent Arvados Foundation as well as the new Lightning project, which will enable extremely fast variant queries through a high-performance in-memory compact genome database.
Sasha Zaranek (CFI) presenting Arvados Lightning
Additionally, the Arvados community welcomed pipeline developer presentations from not only academic labs, such as the Harvard School of Public Health and the Whitehead Institute, but also commercial developers, such as Real Time Genomics and Cypher Genomics. Discussion focused on how pipelines can run faster and more efficiently on the Arvados platform.
Francisco De La Vega (RTG)
Melissa Gymrek (Whitehead) presenting the combination of lobSTR & Arvados
Phillip Pham (Cypher) presenting on the benefits of Arvados to pipeline developers
Details of the Summit, including outputs of the breakout sessions and more photos, are at https://arvados.org/projects/arvados/wiki/Arvados_Summit_-_Fall_2013
Melissa Gymrek from the Erlich Lab at the Whitehead Institute spoke
at the Fall 2013 Arvados Summit on the advantages of building tools like
lobSTR on the Arvados platform. Watch!
Direct link: https://www.youtube.com/watch?v=PRd1OsFbVM0
Clinical Future today started an invitation only alpha test of a Platform-as-a-Service (PaaS) based on Arvados release. You can request an invitation here: http://bit.ly/CuroverseBetaApp
The Curoverse engineering team has been hard at work since October’s user summit, adding requested features and fixing bugs. We’re proud to announce that the latest version of Arvados has been deployed as a platform-as-a-service (PaaS). If you want to try the technology out without installing it on your own, you can apply for the beta.
Just a few of the changes that users can look forward to:
Better development tools.
We have made many improvements to the Python SDK for writing Arvados apps, and beefed up the documentation. Our online documentation of the Arvados API is now almost complete, and we have filled in the gaps in our online tutorial for getting started with Arvados. We are excited to be able to provide toolkits for
samtools, bwa, picard and GATK2 pipelines, including example code to help you get started.
Cleaner user interface.
The front page of the Workbench includes information on jobs, pipelines and collections at your fingertips. The entire Workbench has been optimized to load more quickly and be more responsive to user input.
Work is nearly complete on packaging Arvados as a self-contained system that can be installed on a single machine—even a laptop!—so informaticians and IT directors can more easily try Arvados in a private environment. To that end, we have also expanded the documentation for installing and configuring Arvados.
Reviewing and managing Crunch jobs is much easier. Some of the features we have added to help informaticians keep track of their Crunch tasks include:
- Watch job output in real time with new API calls
- See number of busy and idle compute nodes
- Cancel jobs easily
The road ahead.
We're already hard at work on preparing a formal 1.0 release later this year. This is a great time to get your feedback and ideas in, so if you have thoughts about what would make Arvados an even better tool, we'd love to hear from you!
Today marks the end of a 3-week development sprint. Here's an overview of the most important, user-visible things that changed.
- Workbench has a new, much improved layout
- It is now possible to compare details and output of two or three pipelines in Workbench
- Workbench now allows viewing of the provenance report for an output. This reveals how the output was produced, where the input data came from, and how available/durable the source and intermediate datasets are.
We also made a lot of changes under the hood. Many bugs were fixed, the documentation at http://doc.arvados.org was updated, and we now use Jenkins to run most of our automated tests on every commit.
All in alll, this sprint consisted of 180 commits by 4 committers - you can consult the detailed commit log here:
On March 11th, Curoverse is hosting an Arvados demo and office hours for all Boston-area bioinformaticians! Free pizza and beer, and free Curoverse beta accounts for all!
RSVP here so we know how much to order: http://arvados-2014-03.eventbrite.com/
Photo credit: http://www.flickr.com/photos/tamasrepus/9205438254/
Things have been moving fast here at Curoverse in the last few months! We've been working hard to get to Arvados 1.0 and have lots of updates!
We follow an Agile software development model. Our work is organized into three-week sprints, each of which is planned with the goal of having a releasable product when the sprint ends. (Of course, during the development phase it doesn't always work out quite like that -- but we're trying!)
We'll post a summary of development activity here, with a link to the full release notes, at the end of each sprint. Today marks the end of our most recent sprint, which focused on development tools and resource management. Some highlights:
Collections are identified in the Workbench interface via their user-defined tags if available. No more having to copy down long UUID strings! Additionally, all public PGP data and human/trait metadata can now be accessed through Workbench in our public Arvados instance.
We have begun work on our second-generation Keep server. A new version has been written in the Go programming language, which is designed for writing robust, highly concurrent server code like Keep. We are moving forward with performance profiling and adding important features for permission management.
Much work this sprint went into better administrative tools for Arvados. The Arvados administrator now has access to a rich set of tools for user management, permissions and logging. We have also begun implementing the Data Manager to help administrators monitor disk usage for each user and each site.
The full release notes for this sprint can be found on our wiki: Sprint 2014-04-16. Check it out and, as always, give us a shout if you have any questions or suggestions!