Crunch » History » Version 3

« Previous - Version 3/21 (diff) - Next » - Current version
Tom Clegg, 04/11/2013 01:25 PM


Computation and Pipeline Processing

Arvados has a number of capabilities for defining pipelines, using MapReduce to run distributed computations, and maintaining provenance and reproducibility.

Design Goals

Notable design goals and features include:

  • Easy definition of pipelines
  • Invocation of pipelines with different parameters
  • Tracking the processing of jobs
  • Recording and reproduction of pipelines
  • Distributing computations with MapReduce
  • Integrating with Keep and git repositories to maintain provenance

MapReduce Introduction

Arvados is designed to make it easier for informaticians to use MapReduce to distribute computations across nodes.

(See the MapReduce article at Wikipedia for an introduction to the MapReduce programming model.)

Arvados includes a MapReduce engine specifically designed to address the needs of large sets of genomic data. In these datasets the computations tend to be embarrassingly parallel, so the focus is on map steps rather than reduce or big sorts. (Hadoop is a popular free implementation of MapReduce.)

Arvados is designed to make MapReduce easier to use, even for bioinformaticians who have not used it before.

Pipelines

A pipeline is a set of related MapReduce jobs. The most obvious example consists of two jobs, where the first job's output is the second job's input.

A pipeline template is a pattern that describes the relationships among the component jobs: for example, the template specifies that job A's output is job B's input. A pipeline template is analogous to a Makefile.

A pipeline instance is the act or record of applying a pipeline template to a specific set of inputs. Generally, a pipeline instance refers to the UUIDs of jobs that have been run to satisfy the pipeline components.

Pipeline templates and instances are described in a simple JSON structure.

MapReduce Jobs

Applications and users add jobs to the queue by creating Job resources via the Arvados REST API.

The Arvados job dispatcher picks jobs from the queue, allocates nodes according to specified resource constraints, and invokes the Job Manager.

The Job Manager executes each job task, enforces task sequencing and resource constraints, checks process exit codes and other failure indicators, re-attempts failed tasks when needed, and keeps the Arvados system/metadata DB up-to-date with the job's progress.

For purposes of debugging and testing, the Job Manager can operate as a stand-alone utility in a VM. In this environment, job tasks are executed in the local VM and the job script source code is not required to be in a git repository controlled by Arvados. Therefore, the provenance information — if stored at all — is considerably less valuable.