Jonathan Sheffi, 02/17/2016 09:33 PM
h1. Computation and Pipeline Processing
Arvados has a number of capabilities for defining pipelines, using MapReduce to run distributed computations, and maintaining provenance and reproducibility.
h2. Design Goals
Notable design goals and features include:
* Make use of multiple cores and nodes to produce results faster
* Integrate with [[Keep]] and git repositories to maintain provenance
* Use off-the-shelf software tools in distributed computations
* Efficient over a wide range of problem sizes
* Maximum flexibility of programming language choice
* Maximum flexibility of execution environment
* Tools for building reusable pipelines
* Lower entry barrier for users
h2. MapReduce Introduction
Arvados uses MapReduce to schedule and distribute computations across a pool of compute nodes. "MapReduce":http://en.wikipedia.org/wiki/MapReduce is a programming model for large compute jobs which takes advantage of asynchronous processing opportunities. Programs written for MapReduce can be run efficiently on large clusters of commodity compute nodes. Although it has not yet been widely adopted for genomic analysis, MapReduce lends itself very well to speeding up the processing of genomic data: genomic data analysis tends to be embarrassingly parallel.
Google first described MapReduce in a "paper published in 2004.":http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/mapreduce-osdi04.pdf Hadoop is one of the most popular open source implementations of MapReduce, initially designed to work with web data. Arvados uses a purpose-built MapReduce engine that is optimized for analysis of biomedical data.
h2. The Arvados MapReduce Engine
Arvados MapReduce implements the basic MapReduce approach described in the Google paper, with several innovations targeting genomic and biomedical requirements. Arvados approaches MapReduce from the perspective of a bioinformatician.
Each Arvados MapReduce job contains sets of _job tasks_ which can be computed independently of one another, and can therefore be scheduled asynchronously according to available compute resources. Typically, jobs split input data into chunks which can be processed independently, and run a task for each chunk. This works well for genomic data.
Arvados does not make a distinction between “map” and “reduce” tasks or provide synchronous communication paths between tasks. However, a job can establish sequencing constraints to achieve a similar result (i.e., ensure that all map tasks have completed before a reduce task starts). In practice, the “reduce” stages of genomic analyses tend to be so simple that there is little to gain by introducing the complexity of scheduling and real-time communication between map and reduce tasks.
h2. Creating Pipelines
Because Arvados is designed for informatics problems, which typically involve sequences of analyses and data transformations, pipeline management tools are included with the system.
A _pipeline_ is a set of related MapReduce jobs. The most obvious example consists of two jobs, where the first job's output is the second job's input (e.g. BWA to GATK).
A _pipeline template_ is a file written in JSON that describes the relationships between jobs. For example, the template specifies that job A's output is job B's input. A pipeline template is analogous to a Makefile.
A pipeline template is constructed as set of _pipeline components_, each of which designates the name of a _job script_, data inputs, and parameters. Data inputs are specified as Keep _content addresses_. Job scripts are stored in a Git repository and are referenced using a commit hash or tag. The parameters and inputs can be defined in the template, or filled in later when the pipeline is instantiated.
Note: In Arvados, pipeline logic is ultimately controlled by the user and is not constrained or verified by the system. The included pipeline manager provides one flexible way to define and use pipelines built with MapReduce jobs. However, this is a convenience rather than an imposition: nothing prevents a user from building pipelines manually, or using a completely different pipeline-building tool. In such cases it is still preferable to store the pipeline structure and usage details in the Arvados database, so that provenance reporting tools can make use of them; but in any case, the job database always records the details of _what_ jobs were run and what the resulting output was, even when a pipeline tool neglects to provide the context of _why_ the jobs were run.
h2. Processing Pipelines and Jobs
To run a pipeline, a user (or application) invokes the pipeline manager and specifies a pipeline template along with specific values for the various inputs and parameters defined by the template. Once the inputs and parameters are specified and the pipeline is in use, we refer to this specific pipeline usage event as a _pipeline instance_.
Like the common “make” utility, the pipeline manager determines which components have all of their dependencies satisfied; tests whether any suitable jobs have already been run; submits new jobs; and, if necessary, waits for the jobs to finish. This process is repeated until a successful job has been assigned to each component of the pipeline.
When the pipeline manager is ready to submit a job, but notices that an identical or equivalent job has already been run, it uses the output of the existing job rather than submitting a new one. This makes pipelines run faster and reduces waste of compute resources. In some cases it is desirable to re-run jobs (e.g., to confirm reproducibility or to collect timing statistics) so this behavior can be overridden by the client.
The Arvados _job dispatcher_ is responsible for processing the submitted jobs. It takes jobs from the queue, allocates nodes according to the resource constraints provided, and invokes the _job manager_.
The job manager is responsible for executing a MapReduce job from start to finish. It executes each job task, enforces task sequencing and resource constraints as dictated by the job, checks process exit codes and other failure indicators, re-attempts failed tasks when needed, and stores status updates in the Arvados system database as the job progresses.
Each pipeline instance, and each job, is recorded in the system database to preserve provenance and make it easy to reproduce jobs, even long after they were initially run. The job manager systematically records runtime details (e.g., git commit hash for the job script, compute node’s operating system version) so that methods can be examined, verified, repeated, and reused -- as a rule, rather than as an exception.
h2. Benefits of Arvados MapReduce
There are several implementations of the MapReduce programming model available for developers, including Hadoop MapReduce and Amazon MapReduce. This poses an obvious question: why another version of MapReduce?
Although some of the pipeline and provenance features in Arvados could theoretically be implemented using Hadoop MapReduce, there are distinct benefits to Arvados MapReduce:
* *Simplicity* - Over the years Hadoop MapReduce has grown in complexity as its been extended to handle a wider and wider range of use cases across industries. Arvados MapReduce is a more simple implementation that meets the specific needs of genomic and biomedical data analysis. The simplicity makes it easier to learn, use, debug and administer.
* *Provenance and Reproducibility* - Like Keep, the Arvados distributed file system, Arvados MapReduce is designed to automate tracking the origin of result data, reproducing complex pipelines, and comparing pipelines to one another.
* *Ease of Use* - By implementing a set of interfaces to MapReduce that are designed for informatics problems and integrating these interfaces with the rest of the Arvados system, Arvados gives informaticians a unified framework that is easier to use and quicker to learn. Arvados provides tools suitable for informaticians who typically write Python and C programs, and are not eager to spend time adapting their code to a new language or execution environment.
* *Performance* - Most genomics problems are embarrassingly parallel so they don’t involve large reduce steps or sorts. By optimizing for map steps, Arvados MapReduce can deliver better performance for genomics related analyses.
* *Language and tool neutral* - Arvados MapReduce is designed to apply the benefits of MapReduce using unmodified bioinformatics tools -- from proprietary binary-only tools to in-house C programs -- using universal APIs like HTTP and UNIX pipes. Job scripts can be written in any language (e.g., Python, bash, Perl, R, Ruby) and they run in a normal UNIX environment. Typically, job scripts consist mostly of simple wrappers that invoke MapReduce-unaware tools.
* *Node-level resource allocation* - Arvados MapReduce uses a node as the basic computing resource unit: a compute node runs multiple asynchronous tasks, but only accepts tasks from one job at a time. This gives each job the flexibility to allocate CPU and RAM resources among its tasks to best suit the work being done, and avoids interference and resource competition between unrelated job tasks.
* *Efficient processing of small tasks* - Arvados MapReduce has very low task latency, making it practical to use for even very short single-task jobs. This makes it feasible for users and applications to routinely do _all_ computations in MapReduce and thereby achieve the benefits of complete provenance, reproducibility, and scalability.