Project

General

Profile

Crunch » History » Revision 3

Revision 2 (Anonymous, 04/11/2013 01:53 AM) → Revision 3/21 (Tom Clegg, 04/11/2013 01:25 PM)

h1. Computation and Pipeline Processing 

 Arvados has a number of capabilities for defining pipelines, using MapReduce to run distributed computations, and maintaining provenance and reproducibility. reproducability.  

 h2. Design Goals 

 Notable design goals and features include: 

 * Easy definition of pipelines  
 * Invocation of pipelines with different parameters 
 * Tracking the processing of jobs 
 * Recording and reproduction of pipelines  
 * Distributing computations with MapReduce 
 * Integrating with [[Keep]] and git repositories the Git repository to maintain provenance 

 h2. MapReduce Introduction 

 Arvados is designed to make it easier for informaticians to use MapReduce to distribute computations across nodes.  

 (See From "Wikipedia:":http://en.wikipedia.org/wiki/MapReduce 

 bq. MapReduce is a programming model for processing large data sets, and the "MapReduce article at Wikipedia":http://en.wikipedia.org/wiki/MapReduce for name of an introduction implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.Writing a parallel-executable program has proven over the years to be a very challenging task, requiring various specialized skills. MapReduce programming model.) 

 provides regular programmers the ability to produce parallel distributed programs much more easily, by requiring them to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand, while the "MapReduce System" (also called "infrastructure", "framework") automatically takes care of marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and failures, and overall management of the whole process.  

 Hadoop is a popular free version of MapReduce. Arvados includes a MapReduce engine specifically currently uses its own implementation of MapReduce. This implementation is designed to address the needs of large sets of genomic data. In these datasets the computations tend to be embarrassingly embarassingly parallel, so the focus is on map steps rather than reduce or big sorts. (Hadoop is a popular free implementation of MapReduce.) 

  

 Arvados is designed to make MapReduce easier to use, even use for bioinformaticians who have not used it before. may note be familiar with it. 

 h2. Pipelines 

 A pipeline is a set of related MapReduce jobs. The most obvious example consists of two jobs, where the first job's output is the second job's input. 

 A pipeline _template_ template is a pattern that describes the relationships among the component jobs: for example, the template specifies that job A's output is job B's input. A pipeline template is analogous to a Makefile. 

 A pipeline _instance_ instance is the act or record of applying a pipeline template to a specific set of inputs. Generally, a pipeline instance refers to the UUIDs of jobs that have been run to satisfy the pipeline components. 

 Pipeline templates and instances are described in a simple JSON structure. 

 h2. MapReduce Jobs 

 Applications and users add jobs to the queue by creating Job resources via the Arvados REST API. 

 The Arvados job dispatcher picks jobs from the queue, allocates nodes according to specified resource constraints, and invokes the Job Manager. 

 The Job Manager executes each job task, enforces task sequencing and resource constraints, checks process exit codes and other failure indicators, re-attempts failed tasks when needed, and keeps the Arvados system/metadata DB up-to-date with the job's progress. 

 For purposes of debugging and testing, the Job Manager can operate as a stand-alone utility in a VM. In this environment, job tasks are executed in the local VM and the job script source code is not required to be in a git repository controlled by Arvados. Therefore, the provenance information — if stored at all — is considerably less valuable.