Technical Architecture » History » Version 13

Tom Clegg, 04/10/2013 05:36 PM

1 1 Anonymous
h1. Technical Architecture
2 1 Anonymous
3 11 Ward Vandewege
4 3 Anonymous
5 12 Ward Vandewege
The technical diagram above represents the basic architecture of Arvados. 
6 3 Anonymous
7 9 Ward Vandewege
At the base layer is a "cloud operating system." Currently the platform has been integrated with AWS using POSIX volumes in AWS EBS and EC2 VMS. The system also runs on Xen and Debian/Ubuntu. The roadmap currently includes an [[OpenStack Integration]]. We expect that OpenStack will be the preferred cloud OS for private clouds. 
8 3 Anonymous
9 3 Anonymous
h2. Key Components
10 3 Anonymous
11 13 Tom Clegg
*Data Manager* - The Data Manager helps to orchestrate interactions with data storage. This includes managing rules about permissions, replication, archiving, etc.
12 3 Anonymous
13 13 Tom Clegg
*Content Addressable Object File Store ("Keep")* - Arvados stores files in Keep. Keep is an object file store that has been optimized for big files and write once read many (WORM) scenarios. Keep chunks files into 64MB chunks and distributes them across physical drives or virtual volumes. Keep stores its chunks on any POSIX filesystem. Keep is also a content addressable store (CAS). When a file is stored, each 64MB chunk gets and MD5 hash. Then a "collection" (a text file containing data block hashes) is used to represent the complete set of files. Each collection also has an MD5 hash, which becomes the canonical reference to the set of files.
14 3 Anonymous
15 13 Tom Clegg
*Pipeline Manager* - The Pipeline Manager orchestrates execution of pipelines. It finds jobs suitable for satisfying each step of a pipeline, queues new jobs as needed, tracks job progress, and keeps the metadata database up-to-date with pipeline progress.
16 3 Anonymous
17 13 Tom Clegg
*MapReduce Engine* - The Job Manager executes the distributed processing of the data across cores using the MapReduce system. The Job Manager is optimized for Map steps, and it moves processing to cores that are physically close to where Keep has stored the data. In private clouds where drives and CPUs are on the same node this eliminates disk I/O constraints. (The Job Manager has been optimized for these problems. Another MapReduce engine such as Hadoop could also fulfill this purpose, although no work has been done to enable this.) 
18 3 Anonymous
19 13 Tom Clegg
*In-Memory Compact Genome Database ("Lightning")* - Lightning uses a scale-out, open source in-memory database to store genomic data in a compact genome format. VCF files are not suitable for efficient look-ups so we are developing a format to represent variants and other key data for tertiary analysis. Putting this in in a scale-out, in-memory database will make it possible to do very fast queries of these data. (This part of the project is in the design stage.)
20 3 Anonymous
21 13 Tom Clegg
*API Service* - This component provides OAuth2-authenticated REST APIs to Arvados subsystems (metadata database, jobs, etc.) with the notable exception of Keep (which requires direct access to avoid network performance bottlenecks) and VMs and git (which use the SSH protocol and public key authentication).
22 3 Anonymous
23 13 Tom Clegg
*Workbench* - Workbench is a set of visual tools for using the underlying Arvados services from a browser. This is especially helpful for querying and browsing data, visualizing provenance, and monitoring jobs and pipelines. Workbench has a modular architecture designed for seamless integration with other Arvados applications.
24 1 Anonymous
25 13 Tom Clegg
*Command Line Tools* - The CLI tools provide convenient access to the Arvados API from the command line. 
26 3 Anonymous
27 13 Tom Clegg
*SDKs* - Arvados provides native language SDKs for Python, Perl, Ruby, R, and Java to make it easier to work with the REST APIs in common development environments. (Some SDKs have not yet been implemented.)
28 3 Anonymous
29 13 Tom Clegg
*Documentation* - In addition to the contributors' wiki on the project site, the Arvados source tree includes a documentation project with four sections: 
30 3 Anonymous
31 13 Tom Clegg
* "User Guide": - Introductory and tutorial materials for developers building analysis or web applications using Arvados. 
32 3 Anonymous
33 13 Tom Clegg
* "API Reference": - Details of REST API methods and resources, the MapReduce job execution environment, permission model, etc.
34 3 Anonymous
35 13 Tom Clegg
* "Admin Guide": - Instructions to system administrators for maintaining an Arvados installation.
36 13 Tom Clegg
37 13 Tom Clegg
* "Install Guide": - How to install and configure Arvados on the cloud management platform of your choice.