Federation and Data Sharing » History » Version 2

Anonymous, 04/12/2013 04:56 PM

1 1 Tom Clegg
h1. Federation and Data Sharing
2 1 Tom Clegg
3 1 Tom Clegg
One of the core challenges for genomic research and precision medicine is data sharing. In a world where each organization runs its own private clouds and leverages public clouds, it's inevitable that we all face the challenge of how to share data across data centers and clouds. 
4 1 Tom Clegg
5 1 Tom Clegg
Today, the state of the art is to physically move the data, usually by shipping disks, or, in some cases, through high-bandwidth network connections. This approach has a variety problems: 
6 1 Tom Clegg
7 1 Tom Clegg
* Network traffic can be expensive and slow
8 1 Tom Clegg
9 1 Tom Clegg
* Difficult or impossible to verify the security of the data once it's been shipped to another data center 
10 1 Tom Clegg
11 1 Tom Clegg
* Disk drives are fragile, and often arrive unusable
12 1 Tom Clegg
13 1 Tom Clegg
* Impractical to ship drives at large scale
14 1 Tom Clegg
15 1 Tom Clegg
There are several alternatives that are being explored to answer the data sharing question, but they all bring unique problems:
16 1 Tom Clegg
17 1 Tom Clegg
* *Centralized Resource* - One approach is to put all data in a central location such as a public cloud provider. This sounds good on paper, but given that many organizations will choose to keep their data on premise, this won't address all the needs. The industry is unlikely to standardize on a single provider.
18 1 Tom Clegg
19 1 Tom Clegg
* *Advanced Networks* - There are a number of new and existing technologies that are designed to increase network performance and optimize large file transfers. These approaches can help, but they do not come close to a complete solution to the problems outlined above.
20 1 Tom Clegg
21 1 Tom Clegg
We believe that a better alternative is to federate private and public cloud instances, and move the applications between clouds instead of transferring the data. Arvados is designed to make that possible. 
22 1 Tom Clegg
23 1 Tom Clegg
h2. Federated Clouds 
24 1 Tom Clegg
25 1 Tom Clegg
One of the core design goals of Arvados is the ability to federate Arvados clusters running in different data centers. When two clusters are federated, it's possible to replicate selected portions of the metadata database that describe the data residing on the different clusters.
26 1 Tom Clegg
27 1 Tom Clegg
Once metadata has been replicated, it becomes straightforward to define a collection of files that reside on multiple servers. The content addressing in Keep includes a mechanism for referencing data stored remotely, so it's possible to keep the content addresses globally unique and to identify the clusters where the files are stored.
28 1 Tom Clegg
29 2 Anonymous
When Arvados runs a pipeline that accesses data stored across multiple clusters, the platform can automatically run the jobs and tasks on the clusters where the relevant portions of data reside, and optionally collect the output data (which is often much smaller) in one location. For the informatician, this is just as easy and reliable as running the analysis on a single cluster.
30 1 Tom Clegg
31 1 Tom Clegg
We have demonstrated that this federation model can work using the two clusters that we run for the Personal Genome Project. Going forward, more work will need to be done to complete the implementation. This includes tracking the compute usage at different clusters for chargeback, creating a centralized brokerage service to manage federation, and developing security and governance policies that can be automatically checked against shared transactions.
32 1 Tom Clegg
33 1 Tom Clegg
We believe federation can unlock the potential of exabyte scale datasets distributed in clusters around the world. It can be used to ask some of the most exciting research questions, accelerate collaboration, and diagnose rare and difficult medical cases.