Technical Architecture » History » Version 33

Anonymous, 10/21/2013 12:03 PM

1 1 Anonymous
h1. Technical Architecture
2 1 Anonymous
3 33 Anonymous
4 3 Anonymous
5 12 Ward Vandewege
The technical diagram above represents the basic architecture of Arvados. 
6 3 Anonymous
7 28 Alexander Wait Zaranek
At the base layer is a "cloud operating system." Currently the platform is deployed on AWS (using EBS as POSIX volumes and EC2 instances as VMs) in addition to an on-site installation based on commodity hardware with Xen VMs. The project roadmap includes OpenStack Integration and potentially other cloud operating systems.  
8 3 Anonymous
9 3 Anonymous
h2. Key Components
10 3 Anonymous
11 18 Anonymous
*[[Data Manager]]* - The Data Manager helps to orchestrate interactions with data storage. This includes managing rules about permissions, replication, archiving, etc.
12 3 Anonymous
13 29 Anonymous
*[[Keep|Content Addressable Distributed File System ("Keep")]]* - Arvados stores files in Keep. Keep is a distributed file system that has been optimized for biomedical data files and write once read many (WORM) scenarios. Keep chunks files into 64MB data blocks and distributes them across physical drives or virtual volumes. Keep writes the data blocks to an underlying file system (e.g. Linux ext4). Keep is also a content addressable store (CAS). When a file is stored, a content address is created for each data block using a cryptographic digest of the contents of block. Then a manifest is created that identifies all of the blocks that make up the file. Each manifest has its own unique content address. This ensures every file can be accurately verified every time it is retrieved from the system. Keep also supports the creation of collections, which include multiple files, as a flexible way to define data sets without re-organizing data on disk. 
14 3 Anonymous
15 29 Anonymous
*[[Computation and Pipeline Processing|MapReduce Engine ("Crunch") & Pipeline Management]]* - Crunch manages distributed processing tasks across cores using the MapReduce mode that makes creating algorithms, which use distributed processing to analyze large data sets, much easier. It assigns processing tasks to cores that are physically close to the Keep nodes where data are stored. Crunch is designed to maintain data provenance and pipeline reproducibility. The system supports a flexibly mechanism for defining and invoking pipelines that use common components such as GATK or custom components. It automatically tracks exactly the data inputs and outputs through Keep, the inputs, and the code used for each job through the git repository. 
16 3 Anonymous
17 13 Tom Clegg
*In-Memory Compact Genome Database ("Lightning")* - Lightning uses a scale-out, open source in-memory database to store genomic data in a compact genome format. VCF files are not suitable for efficient look-ups so we are developing a format to represent variants and other key data for tertiary analysis. Putting this in in a scale-out, in-memory database will make it possible to do very fast queries of these data. (This part of the project is in the design stage.)
18 3 Anonymous
19 30 Anonymous
*[[REST API Server]]* - This component provides OAuth2-authenticated REST APIs to Arvados subsystems (metadata database, jobs, etc.) with the notable exception of Keep (which requires direct access to avoid network performance bottlenecks) and VMs and git (which use the SSH protocol and public key authentication).
20 19 Anonymous
21 3 Anonymous
*[[Workbench]]* - Workbench is a set of visual tools for using the underlying Arvados services from a web browser. This is especially helpful for querying and browsing data, visualizing provenance, and monitoring jobs and pipelines. Workbench has a modular architecture designed for seamless integration with other Arvados applications.
22 1 Anonymous
23 29 Anonymous
*[[SDKs|Command Line Interface]]* - The CLI tools provide convenient access to the Arvados API and services in the Arvados platform from the command line. 
24 3 Anonymous
25 29 Anonymous
*[[SDKs]]* - Arvados provides native language SDKs for Python, Perl, Ruby, R, and Java to make it easier to work with the REST APIs in common development environments. The SDKs also support the development of clients for Keep, Crunch and Lightning. (Some SDKs have not yet been implemented.)
26 3 Anonymous
27 18 Anonymous
*[[Documentation]]* - In addition to the contributors' wiki on the project site, the Arvados source tree includes a documentation project with four sections: 
28 3 Anonymous
29 13 Tom Clegg
* "User Guide": - Introductory and tutorial materials for developers building analysis or web applications using Arvados. 
30 3 Anonymous
31 31 Anonymous
* "API Reference": - REST API methods and resources, the MapReduce job execution environment, permission model, etc.
32 3 Anonymous
33 13 Tom Clegg
* "Admin Guide": - Instructions to system administrators for maintaining an Arvados installation.
34 13 Tom Clegg
35 13 Tom Clegg
* "Install Guide": - How to install and configure Arvados on the cloud management platform of your choice.