Version 6 - History - Technical Architecture - Arvados

4

Anonymous

5

The technical diagram above represents the basic architecture of Orvos.

6

7

At the base layer is a "cloud operating system." Currently the platform has been integrated with AWS using ext4 volumes in AWS EBS and EC2 VMS. The system also runs on Xen and Ubuntu. The roadmap currently includes an [[OpenStack Integration]]. We expect that OpenStack will be the preferred cloud OS for private clouds.

8

9

h2. Key Components

10

11

*Data Manager* - The Data Manager helps to orchestrate interactions with data storage. This includes managing rules around duplication, archiving, etc. We expect the data manager will coordinate file format translations (not built yet) and provide an interface layer between the content addressable object file store and the metadata database ("Bob"), which can record metadata about each file.

12

13

*Content Addressable Object File Store ("Keep")* - Orvos stores files in Keep. First keep is an object file store that has been optimized for big files and write once read many (WORM) scenarios. Keep chunks files into 64MB chunks and distributes them across physical drives or virtual volumes. Currently Keep writes to ext4 the file system. Keep is also a content addressable store (CAS). When a file is stored, each 64MB chunk gets and MD5 hash. Then a "collection" (a text file with pointers) is used to represent the complete file. Each collection also has an MD5 hash, which becomes the canonical reference to that file. It's possible to also define higher level collections which represent data sets for computations.

14

15

*Job Manager* - The Pipeline Manager orchestrates execution of pipelines. It records the processing of each pipeline in Bob and tracks the progress of pipelines. Job Manager talks to the MapReduce Engine

16

17

*MapReduce Engine ("Jobs")* - Jobs takes invocations of pipelines pulls the pipeline component code out of the GIT repositories and the data out of Keep and then executes the distributed processing of the data across cores using the MapReduce system. Jobs is optimized for Map steps, and it moves processing to cores that are physically close to where Keep has stored the data. In private clouds where drives and CPUs are on the same node this eliminates disk I/O constraints. (Jobs has been optimized for these problems, but conceivably another MapReduce engine such as Hadoop could be substituted, although no work has been done to enable this.)

18

19

*In-Memory Compact Genome Database ("Lightening") - Lightening will use a scale-out, open source in-memory database to store genomic data in a compact genome format. This part of the project has not been built. We envision that instead of using VCF to represent variants there is a format the can represent variants and other key data for tertiary analysis. Putting this in in a scale-out, in-memory database will make it possible to do very fast quieries of these data.

20

21

*API Engine ("Juicer")* - Juicer is an API engine that manages REST APIs for interfacing with the other system components. Not all interfaces with go through Juicer. For example, interaction with GIT repositories will use standard GIT commands.

22

23

*Dashboard ("Explorvos")* - Explorvos is a set of visual tools for using the underlying Orvos services through a browser-based interface. We expect this will be especially helpful for to query and browse data, visualize provenance, and see the status of pipeline processing among other activities. Explorvos has a modular architecture designed for plug-ins.

24

25

*Command Line Interface* - The CLI is essentially a shell script SDK for interfacing with the Orvos services through a command line.

26

27

*SDKs* - The project envisions creating native language SDKs for Python, Perl, Ruby, R, and Java to make working with the REST APIs easier in these language environments.

28

29

*Documentation* - In addition to this wiki for contributors, we envision an open source documentation project that will author and maintain three core guides:

30

31

* _User Guide_ - All of the information for anyone developing analysis or web applications using Orvos.

32

33

* _Administrator Guide_ - Instructions on how to administer and Orvos cluster for system administrators.

34

35

* _Installation Guide_ - How to install and configure Orvos to run in different cloud environments.

36

37

Project

General

Profile

Arvados

Technical Architecture » History » Version 6