Pipeline template development » History » Version 9

Bryan Cosca, 04/21/2016 05:57 PM

1 1 Bryan Cosca
h1. Pipeline template development
2 1 Bryan Cosca
3 9 Bryan Cosca
This wiki will describe how to write a pipeline template. Some documentation for writing a pipeline template using run-command is already available on "doc.arvados.org.":http://doc.arvados.org/user/tutorials/running-external-program.html Here's an example pipeline template. More documentation for writing pipeline templates to run crunch scripts can be found "here.":https://dev.arvados.org/projects/arvados/wiki/Writing_a_Script_Calling_a_Third_Party_Tool
4 2 Bryan Cosca
5 2 Bryan Cosca
<pre>
6 2 Bryan Cosca
"components": {
7 2 Bryan Cosca
 "JobName": {
8 4 Bryan Cosca
  "script": "JobScript.py",
9 2 Bryan Cosca
  "script_version": "master",
10 2 Bryan Cosca
  "repository": "yourname/yourname",
11 2 Bryan Cosca
  "script_parameters": {
12 2 Bryan Cosca
   "CollectionOne": {
13 2 Bryan Cosca
    "required": true,
14 2 Bryan Cosca
    "dataclass": "Collection"
15 2 Bryan Cosca
   },
16 2 Bryan Cosca
   "ParameterOne":{
17 2 Bryan Cosca
    "required": true,
18 2 Bryan Cosca
    "dataclass": "text",
19 2 Bryan Cosca
    "default": "ParameterOneString"
20 2 Bryan Cosca
   }
21 2 Bryan Cosca
  },
22 2 Bryan Cosca
  "runtime_constraints": {
23 2 Bryan Cosca
   "docker_image": "bcosc/arv-base-java",
24 2 Bryan Cosca
   "arvados_sdk_version": "master"
25 2 Bryan Cosca
  }
26 2 Bryan Cosca
 }
27 2 Bryan Cosca
}
28 1 Bryan Cosca
</pre>
29 2 Bryan Cosca
30 5 Bryan Cosca
The script used for the job is specified under the 'script' parameter, using the commit hash or branch name under 'script_version', which is under the arvados git repository specified under 'repository'. Note: Github repositories can also be used, as long as the repository is public. One important note is that your script must be in a folder called 'crunch_scripts'.
31 4 Bryan Cosca
32 4 Bryan Cosca
When developing a pipeline, we have an arvados best practices guideline for how to use your git repository effectively "here.":https://dev.arvados.org/projects/arvados/wiki/Git_strategy_for_pipeline_development
33 1 Bryan Cosca
34 2 Bryan Cosca
h3. Writing script_parameters
35 1 Bryan Cosca
36 5 Bryan Cosca
"Script_parameters":http://doc.arvados.org/api/schema/PipelineTemplate.html are inputs that can be called in your crunch script. Each script parameter can have any dataclass: Collection, File, number, text. Collection uses the pdh string (ex. 39c6f22d40001074f4200a72559ae7eb+5745), File passes in a file path in a collection (ex. 39c6f22d40001074f4200a72559ae7eb+5745/foo.txt), number passes in any integer, and text passes in any string.
37 1 Bryan Cosca
38 2 Bryan Cosca
The default parameter is useful for using a collection you know will most likely be used, so the user does not have to input it manually. For example, a reference genome collection that will be used throughout the entire pipeline.
39 2 Bryan Cosca
40 2 Bryan Cosca
The title and description parameters are useful for showing what the script parameter is doing, but is not necessary.
41 2 Bryan Cosca
42 7 Bryan Cosca
For example, pipeline template with script parameters:
43 7 Bryan Cosca
44 7 Bryan Cosca
<pre>
45 7 Bryan Cosca
"reference_collection":{
46 7 Bryan Cosca
 "required":true,
47 7 Bryan Cosca
 "dataclass":"Collection"
48 7 Bryan Cosca
},
49 7 Bryan Cosca
"bwa_collection":{
50 7 Bryan Cosca
 "required":true,
51 7 Bryan Cosca
 "dataclass":"Collection",
52 7 Bryan Cosca
 "default":"39c6f22d40001074f4200a72559ae7eb+5745"
53 7 Bryan Cosca
},
54 7 Bryan Cosca
 "sample":{
55 7 Bryan Cosca
 "required":true,
56 7 Bryan Cosca
 "dataclass":"Collection",
57 7 Bryan Cosca
 "title":"Input FASTQ Collection",
58 7 Bryan Cosca
 "description":"Input the fastq collection for BWA mem"
59 7 Bryan Cosca
},
60 7 Bryan Cosca
"read_group":{
61 7 Bryan Cosca
 "required":true,
62 7 Bryan Cosca
 "dataclass":"Text"
63 7 Bryan Cosca
},
64 7 Bryan Cosca
"additional_params":{
65 7 Bryan Cosca
 "required":false,
66 7 Bryan Cosca
 "dataclass":"Text"
67 7 Bryan Cosca
},
68 7 Bryan Cosca
</pre>
69 7 Bryan Cosca
70 7 Bryan Cosca
yields this view in workbench:
71 7 Bryan Cosca
72 7 Bryan Cosca
!7d1b807a78e3bd095d02913dd1074ddf.png!
73 7 Bryan Cosca
74 7 Bryan Cosca
The inputs tab in the pipeline instance page shows all the required parameters. You can click 'Choose' to grab a collection from a project for the reference_collection and input FASTQ Collection parameters. You can type in the read_group you want to use here as well. You can change the bwa_collection, but since you set the default collection, you only need to change it when you need to.
75 7 Bryan Cosca
76 7 Bryan Cosca
For the 'additional_params' parameter, since its not required, its in the 'Components' tab, where you can set it:
77 7 Bryan Cosca
78 8 Bryan Cosca
!752b7261b5710fbf362db26f315fc45d.png!
79 7 Bryan Cosca
80 2 Bryan Cosca
h3. Writing runtime_constraints
81 2 Bryan Cosca
82 2 Bryan Cosca
"Runtime_constraints":http://doc.arvados.org/api/schema/Job.html are inputs in your job that help choose node parameters that your pipeline will run on. Optimizing these parameters can be found in the "Pipeline_Optimization wiki.":https://dev.arvados.org/projects/arvados/wiki/Pipeline_Optimization
83 2 Bryan Cosca
84 6 Bryan Cosca
One runtime constraint is docker_image. It is suggested that while developing you use the latest version of the image, which you can specify by using the name of the image. When in production, you should use the portable data hash of the image you specifically want to use to avoid problems when accidentally changing the image or other conflicts.
85 3 Bryan Cosca
86 3 Bryan Cosca
Using min_nodes will spin up as many nodes as you've specified. Be warned that you can allocate your entire cluster to your job, so use this with caution.
87 3 Bryan Cosca
88 3 Bryan Cosca
The max_tasks_per_node parameter will allow you to allocate more computations on your node. By default, this is 1. If you are under utilizing your nodes, you can try to increase this number. Keep in mind that the total CPU/RAM/space usage of your tasks should fit on your node. It's very easy to overestimate the compute power of your tasks. Using something like "crunchstat-summary":https://dev.arvados.org/projects/arvados/wiki/Pipeline_Optimization should help bridge this gap.