Pipeline template development » History » Revision 3

Revision 2 (Bryan Cosca, 04/19/2016 07:55 PM) → Revision 3/15 (Bryan Cosca, 04/19/2016 08:18 PM)

h1. Pipeline template development 

 This wiki will describe how to write a pipeline template. Some documentation for writing a pipeline template using run-command is available on "": 

 "components": { 
  "JobName": { 
   "script": "JobScript", 
   "script_version": "master", 
   "repository": "yourname/yourname", 
   "script_parameters": { 
    "CollectionOne": { 
     "required": true, 
     "dataclass": "Collection" 
     "required": true, 
     "dataclass": "text", 
     "default": "ParameterOneString" 
   "runtime_constraints": { 
    "docker_image": "bcosc/arv-base-java", 
    "arvados_sdk_version": "master" 

 How to wrap a git repository containing a crunch script and a docker image into a component 
 Link to "Git Strategy for Pipeline Development" wiki page 

 h3. Writing script_parameters 

 "Script_parameters": are inputs that can be called in your crunch script. Each script parameter can have any dataclass: Collection, File, number, text. Collection passes in the pdh string (ex. 39c6f22d40001074f4200a72559ae7eb+5745), File passes in a file path in a collection (ex. 39c6f22d40001074f4200a72559ae7eb+5745/foo.txt), number passes in any integer, and text passes in any string. 

 The default parameter is useful for using a collection you know will most likely be used, so the user does not have to input it manually. For example, a reference genome collection that will be used throughout the entire pipeline. 

 The title and description parameters are useful for showing what the script parameter is doing, but is not necessary. 

 h3. Writing runtime_constraints 

 "Runtime_constraints": are inputs in your job that help choose node parameters that your pipeline will run on. Optimizing these parameters can be found in the "Pipeline_Optimization wiki.": 

 One notable runtime constraint is the arvados_sdk_version. Currently, we do not suggest you use this for production, as it can break pipeline reproducibility. Feel free to use this while developing a pipeline template, as it can be useful to get the specific sdk version you want before downloading it straight into the docker image. 

 Another runtime constraint is docker_image. It is suggested that while developing you use the latest version The actual meaning of the image, which you can specify by using the name of the image. When in production, you should use the portable data hash of the image you specifically want to use to avoid problems when accidentally changing the image or other conflicts. 

 Using min_nodes will spin up as many nodes as you've specified. Be warned that you can allocate your entire cluster to your job, so use this with caution. 

 Setting max_tasks_per_node parameter will allow you to allocate more computations on your node. By default, this is 1. If you are under utilizing your nodes, you can try to increase this number. Keep in mind that the total CPU/RAM/space usage of your tasks should fit on your node. It's very easy to overestimate the compute power of your tasks. Using something like "crunchstat-summary": should help bridge this gap. != 1