h1. Computation and Pipeline Processing Arvados has a number of capabilities for defining pipelines, using MapReduce to run distributed computations, and maintaining provenance and reproducibility. h2. Design Goals Notable design goals and features include: * Make use Easy definition of multiple cores pipelines * Invocation of pipelines with different parameters * Tracking the processing of jobs * Recording and nodes to produce results faster reproduction of pipelines * Distributing computations with MapReduce * Integrate Integrating with [[Keep]] and git repositories to maintain provenance * Use off-the-shelf software tools in distributed computations * Efficient over a wide range of problem sizes * Maximum flexibility of programming language choice * Maximum flexibility of execution environment * Tools for building reusable pipelines * Lower entry barrier for users h2. MapReduce Introduction Arvados uses MapReduce to schedule and distribute computations across a pool of compute nodes. “MapReduce”:http://en.wikipedia.org/wiki/MapReduce is a programming model (See the "MapReduce article at Wikipedia":http://en.wikipedia.org/wiki/MapReduce for large compute jobs which takes advantage of asynchronous processing opportunities. Programs written for MapReduce can be run efficiently on large clusters of commodity compute nodes. Although it has not yet been widely adopted for genomic analysis, MapReduce lends itself very well an introduction to speeding up the processing of genomic data: genomic data analysis tends to be embarrassingly parallel. Google first described MapReduce in a “paper published in 2004.”:http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/mapreduce-osdi04.pdf programming model. Hadoop is one of the most a popular open source implementations free implementation of MapReduce, initially designed to work with web data. MapReduce.) Arvados uses includes a purpose-built MapReduce engine that is optimized for analysis of biomedical data. h2. The Arvados MapReduce Engine Arvados MapReduce implements specifically designed to address the basic MapReduce approach described in the Google paper, with several innovations targeting genomic and biomedical requirements. Arvados approaches MapReduce from the perspective needs of a bioinformatician. Each Arvados MapReduce job contains large sets of _job tasks_ which can be computed independently of one another, and can therefore be scheduled asynchronously according to available compute resources. Typically, jobs split input data into chunks which can be processed independently, and run a task for each chunk. This works well for genomic data. Arvados does not make a distinction between “map” and “reduce” tasks or provide synchronous communication paths between tasks. However, a job can establish sequencing constraints to achieve a similar result (i.e., ensure that all map tasks have completed before a reduce task starts). In practice, these datasets the “reduce” stages of genomic analyses computations tend to be embarrassingly parallel, so simple that there the focus is little to gain by introducing the complexity of scheduling and real-time communication between on map and steps rather than reduce tasks. or big sorts. h2. Creating Pipelines Because Aravdos Arvados is designed to make MapReduce easier to use, even for informatics problems, which typically involve sequences of analyses and data transformations, pipeline management tools are included with the system. bioinformaticians who have not used it before. h2. Pipelines A _pipeline_ pipeline is a set of related MapReduce jobs. The most obvious example consists of two jobs, where the first job's output is the second job's input (e.g. BWA to GATK). input. A _pipeline template_ pipeline _template_ is a file written in JSON pattern that describes the relationships between jobs. For among the component jobs: for example, the template specifies that job A's output is job B's input. A pipeline template is analogous to a Makefile. A pipeline template _instance_ is constructed as set of _pipeline components_, each of which designates the name act or record of applying a _job script_, data inputs, and parameters. Data inputs are specified as Keep _content addresses_. Job scripts are stored in a Git repository and are referenced using a commit hash or tag. The parameters and inputs can be defined in the template, or filled in later when the pipeline is instantiated. bq. Note: In Arvados, pipeline logic is ultimately controlled by the user and is not constrained or verified by the system. The included pipeline manager provides one flexible way template to define and use pipelines built with MapReduce jobs. However, this is a convenience rather than an imposition: nothing prevents a user from building pipelines manually, or using a completely different pipeline-building tool. In such cases it is still preferable to store the pipeline structure and usage details in the Arvados database, so that provenance reporting tools can make use specific set of them; but in any case, the job database always records the details of _what_ jobs were run and what the resulting output was, even when inputs. Generally, a pipeline tool neglects instance refers to provide the context UUIDs of _why_ the jobs were run. The pipeline life cycle: * A pipeline template describes the types of jobs that need have been run to run * A pipeline instance is created by filling in satisfy the blanks in the template (input data and parameters) * The pipeline manager submits jobs to the MapReduce Engine * MapReduce Engine runs a job (the job submits MapReduce tasks, which are distributed to compute nodes) * Progress components. Pipeline templates and outputs of pipeline instances and jobs are tracked described in the system DB a simple JSON structure. h2. Processing Pipelines and MapReduce Jobs To run a pipeline, a user (or application) invokes the pipeline manager Applications and specifies a pipeline template along with specific values for the various inputs and parameters defined by the template. Once the inputs and parameters are specified and the pipeline is in use, we refer to this specific pipeline usage event as a _pipeline instance_. Like the common “make” utility, the pipeline manager determines which components have all of their dependencies satisfied; tests whether any suitable users add jobs have already been run; submits new jobs; and, if necessary, waits for the jobs to finish. This process is repeated until a successful job has been assigned to each component of the pipeline. When the pipeline manager is ready to submit a job, but notices that an identical or equivalent job has already been run, it uses the output of the existing job rather than submitting a new one. This makes pipelines run faster and reduces waste of compute resources. In some cases it is desirable to re-run jobs (e.g., to confirm reproducibility or to collect timing statistics) so this behavior can be overridden queue by creating Job resources via the client. Arvados REST API. The Arvados _job dispatcher_ is responsible for processing the submitted jobs. It takes job dispatcher picks jobs from the queue, allocates nodes according to the specified resource constraints provided, constraints, and invokes the _job manager_. Job Manager. The job manager is responsible for executing a MapReduce job from start to finish. It Job Manager executes each job task, enforces task sequencing and resource constraints as dictated by the job, constraints, checks process exit codes and other failure indicators, re-attempts failed tasks when needed, and stores status updates in keeps the Arvados system database as system/metadata DB up-to-date with the job progresses. job's progress. Each pipeline instance, For purposes of debugging and each job, is recorded in testing, the system database to preserve provenance and make it easy to reproduce jobs, even long after they were initially run. The job manager systematically records runtime details (e.g., git commit hash for the job script, compute node’s operating system version) so that methods Job Manager can be examined, verified, repeated, and reused -- operate as a rule, rather than as an exception. h2. Benefits of Arvados MapReduce There are several implementations of the MapReduce programming model available for developers, including Hadoop MapReduce and Amazon MapReduce. This poses an obvious question: why another version of MapReduce? Although some of the pipeline and provenance features stand-alone utility in Arvados could theoretically be implemented using Hadoop MapReduce, there a VM. In this environment, job tasks are distinct benefits to Arvardos MapReduce: * *Simplicity* - Over the years Hadoop MapReduce has grown executed in complexity as its been extended to handle a wider and wider range of use cases across industries. Arvados MapReduce is a more simple implementation that meets the specific needs of genomic local VM and biomedical data analysis. The simplicity makes it easier to learn, use, debug and administer. * *Provenance and Reproducibility* - Like Keep, the Arvados distributed file system, Arvados MapReduce job script source code is designed to automate tracking the origin of result data, reproducing complex pipelines, and comparing pipelines to one another. * *Ease of Use* - By implementing a set of interfaces to MapReduce that are designed for informatics problems and integrating these interfaces with the rest of the Arvados system, Arvados gives informaticians a unified framework that is easier to use and quicker to learn. Arvados provides tools suitable for informaticians who typically write Python and C programs, and are not eager required to spend time adapting their code to a new language or execution environment. * *Performance* - Most genomics problems are embarrassingly parallel so they don’t involve large reduce steps or sorts. By optimizing for map steps, Arvados MapReduce can deliver better performance for genomics related analyses. * *Language and tool neutral* - Arvados MapReduce is designed to apply the benefits of MapReduce using unmodified bioinformatics tools -- from proprietary binary-only tools to in-house C programs -- using universal APIs like HTTP and UNIX pipes. Job scripts can be written in any language (e.g., Python, bash, Perl, R, Ruby) and they run in a normal UNIX environment. Typically, job scripts consist mostly of simple wrappers that invoke MapReduce-unaware tools. * *Node-level resource allocation* - Arvados MapReduce uses a node as git repository controlled by Arvados. Therefore, the basic computing resource unit: a compute node runs multiple asynchronous tasks, but only accepts tasks from one job provenance information — if stored at a time. This gives each job the flexibility to allocate CPU and RAM resources among its tasks to best suit the work being done, and avoids interference and resource competition between unrelated job tasks. * *Efficient processing of small tasks* - Arvados MapReduce has very low task latency, making it practical to use for even very short single-task jobs. This makes it feasible for users and applications to routinely do _all_ computations in MapReduce and thereby achieve the benefits of complete provenance, reproducibility, and scalability. all — is considerably less valuable.