Project

General

Profile

Technical Architecture » History » Version 60

Nico César, 04/06/2016 02:52 PM

1 1 Anonymous
h1. Technical Architecture
2
3 55 Anonymous
h2. Architecture Diagram
4 49 Nancy Ouyang
5 39 Ward Vandewege
!ArvadosTechnicalDiagramV16_website.png!
6 3 Anonymous
7 59 Nico César
The technical diagram above represents the basic architecture of Arvados. Arvados runs on a elastic computing environment. Arvados does not depend on any particular stack. It can run in cloud services and on computing clusters. The platform is currently deployed on Amazon Web Services (AWS), on Google Compute Platform (GCP) and on Microsoft Azure, as well as on bare metal. Stack-specific integration - for example for AWS or GCP or Azure - is kept to the bare minimum: specifically, the [[Node Manager]] component.
8 43 Ward Vandewege
9 3 Anonymous
h2. Key Components
10
11 58 Anonymous
*[[Keep|Keep - Content-Addressable Distributed Storage System]]* - Arvados stores files in Keep. Keep is a distributed storage system that has been optimized for managing and processing large collections of files (terabytes to petabytes). Keep chunks files into 64MB data blocks and distributes them across physical drives or virtual volumes. Keep writes the data blocks to an underlying file system (e.g. Linux ext4). Keep is also a content addressable store (CAS). When a file is stored, a content address is created for each data block using a cryptographic digest of the contents of block. Then a manifest is created that identifies all of the blocks that make up the file. Each manifest has its own unique content address. This ensures every file can be accurately verified every time it is retrieved from the system. Keep also supports the creation of collections, which include multiple files, as a flexible way to define data sets without re-organizing data on disk. 
12 3 Anonymous
13 44 Ward Vandewege
*[[Computation and Pipeline Processing|Container Management Engine ("Crunch") & Pipeline Management]]* - Crunch manages distributed processing tasks across cores. Tasks are executed inside "Docker":http://docker.io containers. Crunch is designed to maintain data provenance and pipeline reproducibility. The system supports a flexible mechanism for defining and invoking pipelines that use common components such as GATK or custom components. It automatically tracks the data inputs and outputs through Keep, the code used for each job through the git repository, the execution environment through the Docker container that runs the job, and the job parameters through the metadata database. 
14 3 Anonymous
15 30 Anonymous
*[[REST API Server]]* - This component provides OAuth2-authenticated REST APIs to Arvados subsystems (metadata database, jobs, etc.) with the notable exception of Keep (which requires direct access to avoid network performance bottlenecks) and VMs and git (which use the SSH protocol and public key authentication).
16 19 Anonymous
17 3 Anonymous
*[[Workbench]]* - Workbench is a set of visual tools for using the underlying Arvados services from a web browser. This is especially helpful for querying and browsing data, visualizing provenance, and monitoring jobs and pipelines. Workbench has a modular architecture designed for seamless integration with other Arvados applications.
18 1 Anonymous
19
*[[SDKs|Command Line Interface]]* - The CLI tools provide convenient access to the Arvados API and services in the Arvados platform from the command line. 
20
21 60 Nico César
*[[SDKs]]* - Arvados provides native language SDKs for Go, Python, Perl, Ruby, R, and Java to make it easier to work with the REST APIs in common development environments. The SDKs also support the development of clients for Keep, Crunch and Lightning. (Some SDKs have not yet been implemented.)
22 35 Ward Vandewege
23 45 Ward Vandewege
*[[Data Manager]]* - Data Manager helps to orchestrate interactions with data storage. This includes managing rules about permissions, replication, archiving, etc.
24 41 Ward Vandewege
25 46 Ward Vandewege
*[[Node Manager]]* - Node manager manages compute resources in a cloud environment. It starts and stops compute nodes on demand. For a bare metal installation, the number of compute nodes tends to be static, which means Node Manager is not required. Node Manager currently supports AWS and GCP.
26 3 Anonymous
27 48 Ward Vandewege
*[[Keep Proxy]]* - Keep Proxy provides remote (non-LAN) authenticated access to Keep (read and write). It allows remote Keep clients to upload one copy of a block, and takes care of storing the desired number of replicated copies in Keep. It is API compatible with Keep.
28
29 38 Ward Vandewege
*"Documentation":http://doc.arvados.org* - This is the official documentation, which is also included in the Arvados source tree. There is also documentation developer information on the [[Documentation project]] wiki page.