Federation and Data Sharing¶
One of the core challenges for genomic research and precision medicine is data sharing. In a world where each organization runs its own private clouds and leverages public clouds, it's inevitable that we all face the challenge of how to share data across data centers and clouds.
Today, the state of the art is to physically move the data, usually by shipping disks, or, in some cases, through high-bandwidth network connections. This approach has a variety problems:
- Network traffic can be expensive and slow
- Difficult or impossible to verify the security of the data once it's been shipped to another data center
- Disk drives are fragile, and often arrive unusable
- Impractical to ship drives at large scale
There are several alternatives that are being explored to answer the data sharing question, but they all bring unique problems:
- Centralized Resource - One approach is to put all data in a central location such as a public cloud provider. This sounds good on paper, but given that many organizations will choose to keep their data on premise, this won't address all the needs. The industry is unlikely to standardize on a single provider.
- Advanced Networks - There are a number of new and existing technologies that are designed to increase network performance and optimize large file transfers. These approaches can help, but they do not come close to a complete solution to the problems outlined above.
We believe that a better alternative is to federate private and public cloud instances, and move the applications between clouds instead of transferring the data. Arvados is designed to make that possible.
One of the core design goals of Arvados is the ability to federate Arvados clusters running in different data centers. When two clusters are federated, it's possible to replicate selected portions of the metadata database that describe the data residing on the different clusters.
Once metadata has been replicated, it becomes straightforward to define a collection of files that reside on multiple servers. The content addressing in Keep includes a mechanism for referencing data stored remotely, so it's possible to keep the content addresses globally unique and to identify the clusters where the files are stored.
When Arvados runs a pipeline that accesses data stored across multiple clusters, the platform can automatically run the jobs and tasks on the clusters where the relevant portions of data reside, and optionally collect the output data (which is often much smaller) in one location. For the informatician, this is just as easy and reliable as running the analysis on a single cluster.
We have demonstrated that this federation model can work using the two clusters that we run for the Personal Genome Project. Going forward, more work will need to be done to complete the implementation. This includes tracking the compute usage at different clusters for chargeback, creating a centralized brokerage service to manage federation, and developing security and governance policies that can be automatically checked against shared transactions.
We believe federation can unlock the potential of exabyte scale datasets distributed in clusters around the world. It can be used to ask some of the most exciting research questions, accelerate collaboration, and diagnose rare and difficult medical cases.