Project

General

Profile

Introduction to Arvados » History » Version 17

Ward Vandewege, 08/29/2018 06:48 PM
First pass at modernizing this document

1 8 Tom Clegg
h1. Introduction to Arvados
2 1 Anonymous
3
h2. Overview
4 2 Anonymous
5 16 Ward Vandewege
Arvados is a platform for storing, organizing, processing, and sharing genomic and other big data. The platform is designed to make it easier for data scientists to develop analyses, developers to create genomic web applications and IT administers to manage large-scale compute and storage genomic resources. The platform is designed to run in the cloud or on your own hardware.
6 1 Anonymous
7
h2. Why Arvados 
8 8 Tom Clegg
9 17 Ward Vandewege
A set of relatively low-level compute and data management functions are consistent across a wide range of analysis workflows and applications that are being built for genomic data. Unfortunately, every organization working with these data has been forced to build their own custom systems for these low level functions. At the same time, there are proprietary platforms emerging that seek to solve these same problems. Arvados was created to provide a common solution across a wide range of applications that would be free and open source. 
10 1 Anonymous
11
h2. Benefits
12
13
The Arvados platform seeks to solve a set of common problems faced by informaticians and IT Organizations: 
14 2 Anonymous
15 1 Anonymous
Benefits to informaticians: 
16 17 Ward Vandewege
* Make authoring analyses and running workflows in "Common Workflow Language":https://commonwl.org as efficient as possible
17 1 Anonymous
* Provide an environment that can run open source and commercial tools (e.g. Galaxy, GATK, etc.) 
18 17 Ward Vandewege
* Enable deep provenance and reproducibility across all workflows by running all workflow components inside Docker containers 
19 12 Anonymous
* Provide a way to flexibly organize data and ensure data integrity 
20 1 Anonymous
* Make queries of variant and other compact genome data very high-performance
21
* Create a simple way to run distributed batch processing jobs 
22
* Enable the secure sharing of data sets from small to very large 
23 17 Ward Vandewege
* Provide a set of common APIs that enable application and workflow portability across Arvados installations
24 1 Anonymous
* Offer a reference environment for implementation of standards 
25
* Standardize file format translation
26
27
Benefits to IT organizations: 
28
* Low total cost of ownership 
29
* Eliminate unnecessary data duplication 
30
* Ability to create private, on-premise clouds
31 15 Alexander Wait Zaranek
* Self-service provisioning of resources
32 1 Anonymous
* Ability to utilize low-cost off the shelf hardware
33
* Easy-to-manage horizontally scaling architecture
34
* Straight-forward browser-based administration
35
* Provide facilities for hybrid (public and private) clouds 
36
* Ensure full compliance with security and regulatory standards
37
* Support data sets from tens of terabytes to exabytes
38 15 Alexander Wait Zaranek
39 1 Anonymous
h2. Functional Capabilities 
40
41 8 Tom Clegg
Functionally, Arvados has two major sets of capabilities: (a) data management and (b) compute management.
42 1 Anonymous
43
h3. Data Management
44 2 Anonymous
45 17 Ward Vandewege
The data management services are designed to handle all of the challenges associated with storing and organizing large omic data sets. The heart of theses services is [[Keep]], the Arvados storage layer. The data management system is designed to handle the following needs: 
46 1 Anonymous
47 9 Tom Clegg
* Store files (e.g. BAM, FASTQ, VCF, etc.) reliably
48 1 Anonymous
* Store metadata about files for a wide variety of organizational schema
49
* Create collections (sets of files) that can be used in analyses 
50
* Ensure files are not unnecessarily duplicated
51
* Track provenance (sources and methods used to produce data)
52
* Control who can access which files
53
* Offer reliable distributed storage using inexpensive commodity disks
54 9 Tom Clegg
* Control storage redundancy based on importance of datasets
55
56 1 Anonymous
57 17 Ward Vandewege
h3. Container Orchestration
58 2 Anonymous
59 17 Ward Vandewege
Arvados contains a robust container orchestration system named [[Crunch]], which designed to handle the challenges associated with creating and running workflows as large scale distributed processing jobs.
60 1 Anonymous
61 17 Ward Vandewege
* Enable a common way to represent workflows in "Common Workflow Language":https://commonwl.org
62
* Scale compute resources dynamically in supported cloud environments (AWS, Azure, GCP)
63
* Support the use of any workflow creation tool
64
* Optionally, store all workflows in a revision control system (git repository)
65
* Easily and reliably retrieve workflow outputs
66
* Store a record of every workflow that is run
67
* Eliminate the need to re-run workflow components that have already been run
68
* Easily and reliably re-run and verify historical workflow
69
* Easily share results, workflows, and applications between systems 
70
* [future] Create a straightforward way to author web applications that use underlying data and workflows
71
* [future] Run distributed computations across clusters in different data centers to make use of very large data sets 
72 11 Tom Clegg
73 17 Ward Vandewege
The compute management system also includes a sub-component for doing tertiary analysis. This component provides an in-memory database for very high-performance queries of a compact representation of a genome that includes variants and other relevant data needed for tertiary analysis. This component, named Lightning, is in the prototype stage.
74 11 Tom Clegg
75
h2. Virtual Machines 
76 1 Anonymous
77 11 Tom Clegg
Arvados works best in an environment where informaticians receive access to virtual machines (VMs) on a private or public cloud. This approach eliminates the need to manage separate physical servers for different projects, significantly increasing the utilization of underlying hardware resources. It also gives informaticians a great deal of freedom to choose the best operating systems and tools for their work. With virtual machines, each informatician or project team has full isolation, security, autonomy, and privacy for their work. 
78 3 Anonymous
79
The Arvados platform provides shared common services that can be used from within a virtual machine. All of the Arvados services are accessible through APIs.
80
81 11 Tom Clegg
h2. APIs and SDKs 
82 3 Anonymous
83 17 Ward Vandewege
Arvados is designed so all of the data management and compute management services can be accessed through a set of a consistent APIs and interfaces. All functionality is represented in a set of REST APIs. Arvados provides SDKs for popular languages (Python, Go, Perl, Ruby, R, and Java) as well as a standalone tool for command line use.
84 1 Anonymous
85
h2. Workbench
86 2 Anonymous
87 8 Tom Clegg
Arvados includes a browser-based UI which provides a convenient way to do common browsing and searching tasks. Workbench also serves as an application portal, providing a point of access to applications running on Arvados.
88 17 Ward Vandewege
89
h2. History
90
91
The core technology was prototyped at Harvard Medical School (see [[history]]). Arvados is a modern rewrite of the original code, with refactored APIs and significant new capabilities.  
92 1 Anonymous
93
h2. Related Articles 
94 2 Anonymous
95 10 Tom Clegg
[[Technical Architecture]] showing key components