Crunch » History » Version 1

Version 1/21 - Next » - Current version
Anonymous, 04/11/2013 01:25 AM

Computation and Pipeline Processing

Arvados has a number of capabilities for defining pipelines, using MapReduce to run distributed computations, and maintaining provenance and reproducability.

Design Goals

Notable design goals and features include:

  • Easy definition of pipelines
  • Invocation of pipelines with different parameters
  • Tracking the processing of jobs
  • Recording and reproduction of pipelines
  • Distributing computations with MapReduce
  • Integrate with Keep and the Git repository to maintain provenance

MapReduce Introduction

Arvados is designed to make it easier for informaticians to use MapReduce to distribute computations across nodes.

From Wikipedia:

MapReduce is a programming model for processing large data sets, and the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.Writing a parallel-executable program has proven over the years to be a very challenging task, requiring various specialized skills. MapReduce provides regular programmers the ability to produce parallel distributed programs much more easily, by requiring them to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand, while the "MapReduce System" (also called "infrastructure", "framework") automatically takes care of marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process.

Hadoop is a popular free version of MapReduce. Arvados currently uses its own implementation of MapReduce. This implementation is designed to address the needs of large sets of genomic data. In these datasets the computations tend to be embarassingly parallel, so the focus is on map steps rather than reduce or big sorts.

Arvados is designed to make MapReduce easier to use for bioinformaticians who may note be familiar with it.


A pipeline is a set of related MapReduce jobs. The most obvious example consists of two jobs, where the first job's output is the second job's input.

A pipeline template is a pattern that describes the relationships among the component jobs: for example, the template specifies that job A's output is job B's input. A pipeline template is analogous to a Makefile.

A pipeline instance is the act or record of applying a pipeline template to a specific set of inputs. Generally, a pipeline instance refers to the UUIDs of jobs that have been run to satisfy the pipeline components.

Pipeline templates and instances are described in a simple JSON structure.

MapReduce Jobs

Applications and users add jobs to the queue by creating Job resources via the Arvados REST API.

The Arvados job dispatcher picks jobs from the queue, allocates nodes according to specified resource constraints, and invokes the Job Manager.

The Job Manager executes each job task, enforces task sequencing and resource constraints, checks process exit codes and other failure indicators, re-attempts failed tasks when needed, and keeps the Arvados system/metadata DB up-to-date with the job's progress.

For purposes of debugging and testing, the Job Manager can operate as a stand-alone utility in a VM. In this environment, job tasks are executed in the local VM and the job script source code is not required to be in a git repository controlled by Arvados. Therefore, the provenance information — if stored at all — is considerably less valuable.