Crunch » History » Version 5

Tom Clegg, 04/12/2013 05:40 PM

1 1 Anonymous
h1. Computation and Pipeline Processing
2 1 Anonymous
3 3 Tom Clegg
Arvados has a number of capabilities for defining pipelines, using MapReduce to run distributed computations, and maintaining provenance and reproducibility. 
4 1 Anonymous
5 1 Anonymous
h2. Design Goals
6 1 Anonymous
7 1 Anonymous
Notable design goals and features include:
8 1 Anonymous
9 1 Anonymous
* Easy definition of pipelines 
10 1 Anonymous
* Invocation of pipelines with different parameters
11 1 Anonymous
* Tracking the processing of jobs
12 1 Anonymous
* Recording and reproduction of pipelines 
13 1 Anonymous
* Distributing computations with MapReduce
14 3 Tom Clegg
* Integrating with [[Keep]] and git repositories to maintain provenance
15 1 Anonymous
16 1 Anonymous
h2. MapReduce Introduction
17 1 Anonymous
18 5 Tom Clegg
Arvados uses MapReduce to schedule and distribute computations across a pool of compute nodes.
19 1 Anonymous
20 4 Tom Clegg
(See the "MapReduce article at Wikipedia": for an introduction to the MapReduce programming model. Hadoop is a popular free implementation of MapReduce.)
21 1 Anonymous
22 4 Tom Clegg
Arvados includes a MapReduce engine specifically designed to address the needs of large sets of genomic data. In these datasets the computations tend to be embarrassingly parallel, so the focus is on map steps rather than reduce or big sorts.
23 1 Anonymous
24 3 Tom Clegg
Arvados is designed to make MapReduce easier to use, even for bioinformaticians who have not used it before.
25 1 Anonymous
26 1 Anonymous
h2. Pipelines
27 1 Anonymous
28 1 Anonymous
A pipeline is a set of related MapReduce jobs. The most obvious example consists of two jobs, where the first job's output is the second job's input.
29 1 Anonymous
30 3 Tom Clegg
A pipeline _template_ is a pattern that describes the relationships among the component jobs: for example, the template specifies that job A's output is job B's input. A pipeline template is analogous to a Makefile.
31 1 Anonymous
32 3 Tom Clegg
A pipeline _instance_ is the act or record of applying a pipeline template to a specific set of inputs. Generally, a pipeline instance refers to the UUIDs of jobs that have been run to satisfy the pipeline components.
33 1 Anonymous
34 1 Anonymous
Pipeline templates and instances are described in a simple JSON structure.
35 1 Anonymous
36 1 Anonymous
h2. MapReduce Jobs
37 1 Anonymous
38 1 Anonymous
Applications and users add jobs to the queue by creating Job resources via the Arvados REST API.
39 1 Anonymous
40 1 Anonymous
The Arvados job dispatcher picks jobs from the queue, allocates nodes according to specified resource constraints, and invokes the Job Manager.
41 1 Anonymous
42 1 Anonymous
The Job Manager executes each job task, enforces task sequencing and resource constraints, checks process exit codes and other failure indicators, re-attempts failed tasks when needed, and keeps the Arvados system/metadata DB up-to-date with the job's progress.
43 1 Anonymous
44 1 Anonymous
For purposes of debugging and testing, the Job Manager can operate as a stand-alone utility in a VM. In this environment, job tasks are executed in the local VM and the job script source code is not required to be in a git repository controlled by Arvados. Therefore, the provenance information — if stored at all — is considerably less valuable.