Project

General

Profile

Technical Architecture » History » Revision 27

Revision 26 (Jonathan Sheffi, 05/01/2013 02:53 PM) → Revision 27/62 (Anonymous, 05/03/2013 05:04 PM)

h1. Technical Architecture 

 !arvados_technical_diagram_v13.png! 

 The technical diagram above represents the basic architecture of Arvados.  

 At the base layer is a "cloud operating system." Currently the platform is deployed on AWS (using EBS as POSIX volumes and EC2 instances as VMs) in addition to an on-site installation based on commodity hardware with Xen VMs. The project roadmap includes OpenStack Integration and potentially other cloud OS.   

 h2. Key Components 

 *[[Data Manager]]* - The Data Manager helps to orchestrate interactions with data storage. This includes managing rules about permissions, replication, archiving, etc. 

 *[[Keep|Content Addressable Distributed File System ("Keep")]]* - Arvados stores files in Keep. Keep is a distributed an object file system store that has been optimized for biomedical data big files and write once read many (WORM) scenarios. Keep chunks files into 64MB data blocks chunks and distributes them across physical drives or virtual volumes. Keep writes the data blocks to an underlying file system (e.g. Linux ext4). stores its chunks on any POSIX filesystem. Keep is also a content addressable store (CAS). When a file is stored, each 64MB chunk gets and MD5 hash. Then a content address is created for each "collection" (a text file containing data block using a cryptographic digest of hashes) is used to represent the contents complete set of block. Then a manifest is created that identifies all of the blocks which make up the file. files. Each manifest collection also has its own unique content address. This creates a system where every file can be accurately verified every time it is retrieved from an MD5 hash, which becomes the system. canonical reference to the set of files. 

 *[[Computation and Pipeline Processing|Pipeline Manager]]* - The Pipeline Manager orchestrates execution of pipelines. It finds jobs suitable for satisfying each step of a pipeline, queues new jobs as needed, tracks job progress, and keeps the metadata database up-to-date with pipeline progress. 

 *[[Computation and Pipeline Processing|MapReduce Engine ("Crunch")]]* - Crunch manages distributed processing tasks across cores using the MapReduce model. It assigns processing tasks to cores that are physically close to the Keep nodes where data is stored. In private clouds where disks and CPUs are on the same physical node, this eliminates network I/O bottlenecks. (The MapReduce Engine has been optimized for performance in such an on-site cloud environment. A different MapReduce engine such as Hadoop could also be used with Arvados, although no work has been done to enable this.)  

 *In-Memory Compact Genome Database ("Lightning")* - Lightning uses a scale-out, open source in-memory database to store genomic data in a compact genome format. VCF files are not suitable for efficient look-ups so we are developing a format to represent variants and other key data for tertiary analysis. Putting this in in a scale-out, in-memory database will make it possible to do very fast queries of these data. (This part of the project is in the design stage.) 

 *[[REST API Server|API Service]]* - This component provides OAuth2-authenticated REST APIs to Arvados subsystems (metadata database, jobs, etc.) with the notable exception of Keep (which requires direct access to avoid network performance bottlenecks) and VMs and git (which use the SSH protocol and public key authentication). 

 *[[Workbench]]* - Workbench is a set of visual tools for using the underlying Arvados services from a browser. This is especially helpful for querying and browsing data, visualizing provenance, and monitoring jobs and pipelines. Workbench has a modular architecture designed for seamless integration with other Arvados applications. 

 *[[SDKs|Command Line Interface]]* - The CLI tools provide convenient access to the Arvados API from the command line.  

 *[[SDKs]]* - Arvados provides native language SDKs for Python, Perl, Ruby, R, and Java to make it easier to work with the REST APIs in common development environments. There are also SDKs that support development of clients for Keep and the MapReduce Engine. (Some SDKs have not yet been implemented.) 

 *[[Documentation]]* - In addition to the contributors' wiki on the project site, the Arvados source tree includes a documentation project with four sections:  

 * "User Guide":http://doc.arvados.org/user/ - Introductory and tutorial materials for developers building analysis or web applications using Arvados.  

 * "API Reference":http://doc.arvados.org/api/ - Details of REST API methods and resources, the MapReduce job execution environment, permission model, etc. 

 * "Admin Guide":http://doc.arvados.org/admin/ - Instructions to system administrators for maintaining an Arvados installation. 

 * "Install Guide":http://doc.arvados.org/install/ - How to install and configure Arvados on the cloud management platform of your choice.