Project

General

Profile

Introduction to Arvados » History » Version 1

Anonymous, 04/06/2013 12:06 PM

1 1 Anonymous
h1. Introduction to Orvos
2
3
h2. Overview
4
Orvos is a platform for storing, organizing, processing, and sharing genomic and other biomedical big data. The platform is designed to make it easier for bioinformaticians to develop analysis, developers to create web applications with genomic data and IT administers to manage large-scale compute and storage resources for genomic data. The platform is designed to run on top of "cloud operating systems" such as Amazon Web Services and OpenStack. Currently, there are implementations that work on AWS and Xen+Ubuntu. 
5
6
h2. Project
7
The core technology has been under development at Harvard Medical School for many years. We are now in the process of refactoring original code, refactoring the APIs, and developing significant new capabilities. 
8
9
h2. Why Orvos 
10
A set of relatively low-level compute and data management functions are consistent across a wide range of analysis pipelines and applications that are being built for genomic data. Unfortunately, every organization working with these data have been forced to build their own custom systems for these low level functions. At the same time, there are proprietary platforms emerging that seek to solve these same problems. Orvos was created to provide and common solution that could be used across a wide range of applications and would be free and open source. 
11
12
h2. Benefits
13
The Orvos platform seeks to solve a set of common problems that face informaticians and IT Organizations: 
14
15
Benefits to informaticians: 
16
* Make authoring analyses and constructing pipelines in any language as efficient as possible
17
* Provide an environment that can run open source and commercial tools (e.g. Galaxy, GATK, etc.) 
18
* Enable deep provenance and reproducability across all pipelines 
19
* Provide a way to flexibly organize data and ensure data integrity 
20
* Make queries of variant and other compact genome data very high-performance
21
* Create a simple way to run distributed batch processing jobs 
22
* Enable the secure sharing of data sets from small to very large 
23
* Provide a set of common APIs that enable application and pipeline portability across systems
24
* Offer a reference environment for implementation of standards 
25
* Standardize file format translation
26
27
Benefits to IT organizations: 
28
* Low total cost of ownership 
29
* Eliminate unnecessary data duplication 
30
* Ability to create private, on-premise clouds
31
* Self-service provision of resources
32
* Ability to utilize low-cost off the shelf hardware
33
* Easy-to-manage horizontally scaling architecture
34
* Straight-forward browser-based administration
35
* Provide facilities for hybrid (public and private) clouds 
36
* Ensure full compliance with security and regulatory standards
37
* Support data sets from 10s of TB to exabytes
38
39
h2. Functional Capabilities 
40
Functionally, Orvos has two major sets of capabilities: (a) data management and (b) compute management. 
41
42
h3. Data Management
43
The data manage services are designed to handle all of the challenges associated with storing and organizing large omic data sets. The heart of theses services is the Data Manager, which brokers data storage. The data management system is designed to handle the following needs: 
44
45
* Store files (e.g. BAM, FASTQ, VCF, Etc.) in a reliable way
46
* Store metadata about files for a wide variety of organizational schema
47
* Create collections (sets of files) that can be used in analyses 
48
* Ensure files are not unnecessary duplicated
49
* Maintain provenance on files (be able to identify their origin) 
50
* Secure access to files 
51
* Translate files between formats 
52
* Make it easy to access files 
53
* Leverage large arrays of distributed commodity drives for reliable storage
54
55
h3. Compute Management 
56
The compute management services are designed to handle the challenges associated with creating and running pipelines as large scale distributed processing jobs. 
57
58
* Enable a common way to represent pipelines (JSON) 
59
* Support the use of any pipeline creation tool 
60
* Keep all pipeline code in a GIT repository 
61
* Run pipelines as distributed computations using MapReduce
62
* Easily retrieve and store data from pipelines in the data management system 
63
* Store a record of every pipeline that is run 
64
* Eliminate the need to re-run pipeline jobs that have already been run
65
* Make it possible to easily and reliably re-run and verify any past pipeline 
66
* Create a straightforward way to author web applications that use underlying data and pipelines 
67
* Enable easy sharing of pipelines and applications between systems 
68
* Be able to run distributed computations across clusters in different data centers to access very large data sets 
69
70
The compute management system also includes a sub-component for doing tertiary analysis. This component has not been built yet, but we envision that it will provide an in-memory database for very high-performance queries of a compact representation of a genome that includes variants and other relevant data needed for tertiary analysis. (Learn more about the [[compact genome]] here.) 
71
72
h2. APIs and SDKs 
73
Orvos is designed so all of the data management and compute management services can be accessed through a set of a consistent APIs and interfaces. Most of the functionality is represented in a set of REST APIs. Some components use native interfaces (e.g. GIT). Since most informaticians are unfamiliar with the REST syntax, their is a roadmap to develop SDKs for specific languages (Python, Perl, Ruby, R, and Java). There is also a command line interface to the system. 
74
75
h2. Dashboard 
76
In addition to the APIs we are building a browser-based UI for the representing key information in the system and to provide visual tools for common activities that are well represented through tools. 
77
78
h2. Related Articles 
79
[[Technical Architecture]]
80
[[Key Components]]