Technical Architecture » History » Version 12

« Previous - Version 12/62 (diff) - Next » - Current version
Ward Vandewege, 04/08/2013 04:51 PM


Technical Architecture

The technical diagram above represents the basic architecture of Arvados.

At the base layer is a "cloud operating system." Currently the platform has been integrated with AWS using POSIX volumes in AWS EBS and EC2 VMS. The system also runs on Xen and Debian/Ubuntu. The roadmap currently includes an OpenStack Integration. We expect that OpenStack will be the preferred cloud OS for private clouds.

Key Components

Data Manager - The Data Manager helps to orchestrate interactions with data storage. This includes managing rules around duplication, archiving, etc. We expect the data manager will coordinate file format translations (not built yet) and provide an interface layer between the content addressable object file store and the metadata database ("Bob"), which can record metadata about each file.

Content Addressable Object File Store ("Keep") - Arvados stores files in Keep. Keep is an object file store that has been optimized for big files and write once read many (WORM) scenarios. Keep chunks files into 64MB chunks and distributes them across physical drives or virtual volumes. Keep stores its chunks on any POSIX filesystem. Keep is also a content addressable store (CAS). When a file is stored, each 64MB chunk gets and MD5 hash. Then a "collection" (a text file with pointers) is used to represent the complete file. Each collection also has an MD5 hash, which becomes the canonical reference to that file. It's possible to also define higher level collections which represent data sets for computations.

Job Manager - The Pipeline Manager orchestrates execution of pipelines. It records the processing of each pipeline in Bob and tracks the progress of pipelines. Job Manager talks to the MapReduce Engine

MapReduce Engine ("Jobs") - Jobs takes invocations of pipelines pulls the pipeline component code out of the GIT repositories and the data out of Keep and then executes the distributed processing of the data across cores using the MapReduce system. Jobs is optimized for Map steps, and it moves processing to cores that are physically close to where Keep has stored the data. In private clouds where drives and CPUs are on the same node this eliminates disk I/O constraints. (Jobs has been optimized for these problems, but conceivably another MapReduce engine such as Hadoop could be substituted, although no work has been done to enable this.)

In-Memory Compact Genome Database ("Lightening") - Lightening will use a scale-out, open source in-memory database to store genomic data in a compact genome format. This part of the project has not been built. We envision that instead of using VCF to represent variants there is a format the can represent variants and other key data for tertiary analysis. Putting this in in a scale-out, in-memory database will make it possible to do very fast queries of these data.

API Engine ("Juicer") - Juicer is an API engine that manages REST APIs for interfacing with the other system components. Most, but not all interfaces with Arvados go through Juicer. For example, interaction with GIT repositories uses standard GIT commands.

Dashboard ("Workbench") - Workbench is a set of visual tools for using the underlying Arvados services through a browser-based interface. We expect this will be especially helpful for to query and browse data, visualize provenance, and see the status of pipeline processing among other activities. Workbench has a modular architecture designed for plug-ins.

Command Line Interface - The CLI is essentially a shell script SDK for interfacing with the Arvados services through a command line.

SDKs - The project envisions creating native language SDKs for Python, Perl, Ruby, R, and Java to make working with the REST APIs easier in these language environments.

Documentation - In addition to this wiki for contributors, we envision an open source documentation project that will author and maintain three core guides:

  • User Guide - All of the information for anyone developing analysis or web applications using Arvados.
  • Administrator Guide - Instructions on how to administer and Arvados cluster for system administrators.
  • Installation Guide - How to install and configure Arvados to run in different cloud environments.