Welcome to the Arvados Project
We’re excited to be launching the Arvados Project website today. Arvados is born out of seven years of research and technology development to manage and analyze genomic data.
In 2006, we first started thinking about the genomic big data problem when we began planning the Personal Genome Project informatics system. The question we asked was simple: “How do we create a modern computing platform for managing and analyzing an exabyte for genomic data distributed in datacenters around the world?”
We quickly decided we needed to abandon the traditional high-performance computing architecture that combines networked attached storage with a compute cluster and a batch-queuing system like Sun Grid Engine. There were simply too many problems with scaling and using this kind of a system. Cost being one of the largest.
So we turned to innovations in the web industry (especially from Google) for approaches to large scale distributed computing with big data sets. These strategies included leveraging very, low-cost commodity hardware, virtualization, distributed object storage, MapReduce, in-memory databases, and other similar technologies. Then we focused on how to apply these technology strategies to the unique requirements of genomic data and biomedical computing.
After several years of development the result is a system that we’re using now at the Personal Genome Project to power two clusters with more than 300 TB of storage and 500 cores.
The system uses a horizontally scaling hardware architecture with uniform nodes that combine storage and compute on each node, which means it’s easy to start small and grow to very large scale. It’s also designed to facilitate distributed computations across clusters running in different datacenters.
Now we’re taking all the software we’ve built for the PGP and releasing it as Arvados under the AGPLv3 license. Looking forward, we see Arvados as a platform that can run on top of cloud operating systems such as OpenStack, Amazon Web Services and others. It provides a common set of services and APIs for managing omic data and running pipelines that analyze that data.
In order to realize the promise of genomics and precision medicine, we need to take the core storage, data management, computation, data sharing, and distributed computing layer of the biomedical computing stack and make it more consistent across organizations, which is why we created the Arvados.org project. Working together we’ll utilize our limited resources as a community more effectively, lower operating costs, and enable sharing of data and applications across organizations.
We know that we don’t have all the answers, but we think we have a very good foundation to build on. In the same way that other open source projects have helped to establish common infrastructure in other industries, we’re hoping Arvados will become a platform the biomedical community can build on, extend, and adapt to a new world where we’re using petabytes and ultimately exabytes of genomic and other biomedical data to deliver better patient care.
Everyone on the project looks forward to your comments, critiques and contributions.