Project

General

Profile

Actions

Pipeline template development » History » Revision 11

« Previous | Revision 11/15 (diff) | Next »
Bryan Cosca, 04/21/2016 07:13 PM


Pipeline template development

This wiki will describe how to write a pipeline template. Some documentation for writing a pipeline template using run-command is already available on doc.arvados.org. Here's an example pipeline template. More documentation for writing pipeline templates to run crunch scripts can be found here.

Here is an example pipeline template. Pipeline templates are composed of components, where each component is a job. The rest of the document describes the specific pieces of a component/job.

"components": {
 "JobName": {
  "script": "JobScript.py",
  "script_version": "master",
  "repository": "yourname/yourname",
  "script_parameters": {
   "CollectionOne": {
    "required": true,
    "dataclass": "Collection" 
   },
   "ParameterOne":{
    "required": true,
    "dataclass": "text",
    "default": "ParameterOneString" 
   }
  },
  "runtime_constraints": {
   "docker_image": "bcosc/arv-base-java" 
  }
 }
}

The script used for the job is specified under the 'script' parameter, using the commit hash or branch name under 'script_version', which is under the arvados git repository specified under 'repository'. Note: Github repositories can also be used, as long as the repository is public. One important note is that your script must be in a folder called 'crunch_scripts'.

When developing a pipeline, we have an arvados best practices guideline for how to use your git repository effectively here.

Writing script_parameters

Script_parameters are inputs that can be called in your crunch script. Each script parameter can have any dataclass: Collection, File, number, text. Collection uses the pdh string (ex. 39c6f22d40001074f4200a72559ae7eb+5745), File passes in a file path in a collection (ex. 39c6f22d40001074f4200a72559ae7eb+5745/foo.txt), number passes in any integer, and text passes in any string. They can also be optional by setting the required flag to true/false.

The default parameter is useful for using a collection you know will most likely be used, so the user does not have to input it manually. For example, a reference genome collection that will be used throughout the entire pipeline.

The title and description parameters are useful for showing what the script parameter is doing, but is not necessary.

For example, pipeline template with script parameters:

"reference_collection":{
 "required":true,
 "dataclass":"Collection" 
},
"bwa_collection":{
 "required":true,
 "dataclass":"Collection",
 "default":"39c6f22d40001074f4200a72559ae7eb+5745" 
},
 "sample":{
 "required":true,
 "dataclass":"Collection",
 "title":"Input FASTQ Collection",
 "description":"Input the fastq collection for BWA mem" 
},
"read_group":{
 "required":true,
 "dataclass":"Text" 
},
"extra_file":{
 "required":true,
 "dataclass":"File" 
},
"extra_number":{
 "required":true,
 "dataclass":"number" 
},
"additional_params":{
 "required":false,
 "dataclass":"Text" 
},

which creates this pipeline instance:

The inputs tab in the pipeline instance page shows all the required parameters. You can click 'Choose' to grab a collection from a project for the reference_collection and input FASTQ Collection parameters. You can type in the read_group and extra_number you want to use here as well. You can change the bwa_collection, but since you set the default collection, you only need to change it when you need to.

The "Components" tab in the pipeline instance page shows all the parameters. Thus it is the only place where non-required parameters, such as 'additional_params' may be set.

Writing runtime_constraints

Runtime_constraints are inputs in your job that help choose node parameters that your pipeline will run on. Optimizing these parameters can be found in the Pipeline_Optimization wiki.

The docker_image runtime constraint controls the docker image used to run your job. If not specified, the arvados/jobs image gets used.

It is suggested that while developing you use the latest version of the image, which you can specify by using the name of the image. When in production, you should use the portable data hash of the image you specifically want to use to avoid problems when accidentally changing the image or other conflicts.

Using min_nodes will spin up as many nodes as you've specified for your job. Be warned that you can allocate your entire cluster to your job, so use this with caution.

The max_tasks_per_node parameter will allow you to allocate more tasks on your node. By default, this is 1. If you are under utilizing your nodes, you can try to increase this number. For example, setting max_task_per_node to 4 will allow 4 tasks to run on one compute node. If there are more tasks to be scheduled, they will be queued until a compute node is free. The total amount of compute nodes set to your job is specified using min_nodes.

Keep in mind that the total CPU/RAM/space usage of your tasks should fit on your node. It's very easy to overestimate the compute power of your tasks. Using something like crunchstat-summary should help bridge this gap.

Updated by Bryan Cosca almost 8 years ago · 11 revisions