Project

General

Profile

Technical Architecture » History » Revision 13

Revision 12 (Ward Vandewege, 04/08/2013 04:51 PM) → Revision 13/62 (Tom Clegg, 04/10/2013 05:36 PM)

h1. Technical Architecture 

 !Technical_Diagram_v8.png! 

 The technical diagram above represents the basic architecture of Arvados.  

 At the base layer is a "cloud operating system." Currently the platform has been integrated with AWS using POSIX volumes in AWS EBS and EC2 VMS. The system also runs on Xen and Debian/Ubuntu. The roadmap currently includes an [[OpenStack Integration]]. We expect that OpenStack will be the preferred cloud OS for private clouds.  

 h2. Key Components 

 *Data Manager* - The Data Manager helps to orchestrate interactions with data storage. This includes managing rules about permissions, replication, around duplication, archiving, etc. 

 We expect the data manager will coordinate file format translations (not built yet) and provide an interface layer between the content addressable object file store and the metadata database ("Bob"), which can record metadata about each file.   

 *Content Addressable Object File Store ("Keep")* - Arvados stores files in Keep. Keep is an object file store that has been optimized for big files and write once read many (WORM) scenarios. Keep chunks files into 64MB chunks and distributes them across physical drives or virtual volumes. Keep stores its chunks on any POSIX filesystem. Keep is also a content addressable store (CAS). When a file is stored, each 64MB chunk gets and MD5 hash. Then a "collection" (a text file containing data block hashes) with pointers) is used to represent the complete set of files. file. Each collection also has an MD5 hash, which becomes the canonical reference to the set of files. 

 *Pipeline that file. It's possible to also define higher level collections which represent data sets for computations.  

 *Job Manager* - The Pipeline Manager orchestrates execution of pipelines. It finds jobs suitable for satisfying records the processing of each step of a pipeline, queues new jobs as needed, pipeline in Bob and tracks job progress, and keeps the metadata database up-to-date with pipeline progress. 

 progress of pipelines. Job Manager talks to the MapReduce Engine  

 *MapReduce Engine* Engine ("Jobs")* - The Job Manager Jobs takes invocations of pipelines pulls the pipeline component code out of the GIT repositories and the data out of Keep and then executes the distributed processing of the data across cores using the MapReduce system. The Job Manager Jobs is optimized for Map steps, and it moves processing to cores that are physically close to where Keep has stored the data. In private clouds where drives and CPUs are on the same node this eliminates disk I/O constraints. (The Job Manager (Jobs has been optimized for these problems. Another problems, but conceivably another MapReduce engine such as Hadoop could also fulfill this purpose, be substituted, although no work has been done to enable this.)  

 *In-Memory Compact Genome Database ("Lightning")* ("Lightening")* - Lightning uses Lightening will use a scale-out, open source in-memory database to store genomic data in a compact genome format. This part of the project has not been built. We envision that instead of using VCF files are not suitable for efficient look-ups so we are developing to represent variants there is a format to the can represent variants and other key data for tertiary analysis. Putting this in in a scale-out, in-memory database will make it possible to do very fast queries of these data. (This part of the project is in the design stage.) 

 *API Service* Engine ("Juicer")* - This component provides OAuth2-authenticated Juicer is an API engine that manages REST APIs to for interfacing with the other system components. Most, but not all interfaces with Arvados subsystems (metadata database, jobs, etc.) go through Juicer. For example, interaction with the notable exception of Keep (which requires direct access to avoid network performance bottlenecks) and VMs and git (which use the SSH protocol and public key authentication). 

 *Workbench* GIT repositories uses standard GIT commands.  

 *Dashboard ("Workbench")* - Workbench is a set of visual tools for using the underlying Arvados services from through a browser. This is browser-based interface. We expect this will be especially helpful for querying to query and browsing browse data, visualizing visualize provenance, and monitoring jobs and pipelines. see the status of pipeline processing among other activities. Workbench has a modular architecture designed for seamless integration with other Arvados applications. 

 plug-ins.  

 *Command Line Tools* Interface* - The CLI tools provide convenient access to is essentially a shell script SDK for interfacing with the Arvados API from the services through a command line.  

 *SDKs* - Arvados provides The project envisions creating native language SDKs for Python, Perl, Ruby, R, and Java to make it easier to work working with the REST APIs easier in common development these language environments. (Some SDKs have not yet been implemented.) 

  

 *Documentation* - In addition to the contributors' this wiki on the project site, the Arvados for contributors, we envision an open source tree includes a documentation project with four sections: that will author and maintain three core guides:  

 * "User Guide":http://doc.arvados.org/user/ _User Guide_ - Introductory and tutorial materials All of the information for developers building anyone developing analysis or web applications using Arvados.  

 * "API Reference":http://doc.arvados.org/api/ _Administrator Guide_ - Details of REST API methods and resources, the MapReduce job execution environment, permission model, etc. 

 * "Admin Guide":http://doc.arvados.org/admin/ - Instructions on how to administer and Arvados cluster for system administrators for maintaining an Arvados installation. 

 administrators.  

 * "Install Guide":http://doc.arvados.org/install/ _Installation Guide_ - How to install and configure Arvados on the to run in different cloud management platform of your choice. 
 environments.