Pipeline template development » History » Version 15

Sarah Guthrie, 04/21/2016 07:45 PM

1 1 Bryan Cosca
h1. Pipeline template development
2 1 Bryan Cosca
3 9 Bryan Cosca
This wiki will describe how to write a pipeline template. Some documentation for writing a pipeline template using run-command is already available on "doc.arvados.org.":http://doc.arvados.org/user/tutorials/running-external-program.html Here's an example pipeline template. More documentation for writing pipeline templates to run crunch scripts can be found "here.":https://dev.arvados.org/projects/arvados/wiki/Writing_a_Script_Calling_a_Third_Party_Tool
4 2 Bryan Cosca
5 10 Bryan Cosca
Here is an example pipeline template. Pipeline templates are composed of components, where each component is a job. The rest of the document describes the specific pieces of a component/job.
6 10 Bryan Cosca
7 2 Bryan Cosca
8 2 Bryan Cosca
"components": {
9 2 Bryan Cosca
 "JobName": {
10 4 Bryan Cosca
  "script": "JobScript.py",
11 2 Bryan Cosca
  "script_version": "master",
12 2 Bryan Cosca
  "repository": "yourname/yourname",
13 2 Bryan Cosca
  "script_parameters": {
14 2 Bryan Cosca
   "CollectionOne": {
15 2 Bryan Cosca
    "required": true,
16 2 Bryan Cosca
    "dataclass": "Collection"
17 2 Bryan Cosca
18 2 Bryan Cosca
19 2 Bryan Cosca
    "required": true,
20 2 Bryan Cosca
    "dataclass": "text",
21 2 Bryan Cosca
    "default": "ParameterOneString"
22 2 Bryan Cosca
23 1 Bryan Cosca
24 2 Bryan Cosca
  "runtime_constraints": {
25 10 Bryan Cosca
   "docker_image": "bcosc/arv-base-java"
26 2 Bryan Cosca
27 2 Bryan Cosca
28 2 Bryan Cosca
29 1 Bryan Cosca
30 2 Bryan Cosca
31 5 Bryan Cosca
The script used for the job is specified under the 'script' parameter, using the commit hash or branch name under 'script_version', which is under the arvados git repository specified under 'repository'. Note: Github repositories can also be used, as long as the repository is public. One important note is that your script must be in a folder called 'crunch_scripts'.
32 4 Bryan Cosca
33 4 Bryan Cosca
When developing a pipeline, we have an arvados best practices guideline for how to use your git repository effectively "here.":https://dev.arvados.org/projects/arvados/wiki/Git_strategy_for_pipeline_development
34 1 Bryan Cosca
35 2 Bryan Cosca
h3. Writing script_parameters
36 1 Bryan Cosca
37 14 Sarah Guthrie
"Script_parameters":http://doc.arvados.org/api/schema/PipelineTemplate.html are inputs that can be accessed by your crunch script (See [[Writing_a_Script_Calling_a_Third_Party_Tool]] for an example). Each script parameter defines a dataclass: Collection, File, number, or text. The "Collection" dataclass passes a string of the portable data hash of that collection (ex. 39c6f22d40001074f4200a72559ae7eb+5745), "File" passes in a file path concatenated to the portable data hash (ex. 39c6f22d40001074f4200a72559ae7eb+5745/foo.txt), "number" passes in any integer, and "text" passes in any string. 
38 14 Sarah Guthrie
39 14 Sarah Guthrie
Each script_parameter includes a "required" boolean in the pipeline template. Setting "required" to false sets that parameter to be optional.  
40 1 Bryan Cosca
41 2 Bryan Cosca
The default parameter is useful for using a collection you know will most likely be used, so the user does not have to input it manually. For example, a reference genome collection that will be used throughout the entire pipeline.
42 2 Bryan Cosca
43 2 Bryan Cosca
The title and description parameters are useful for showing what the script parameter is doing, but is not necessary.
44 2 Bryan Cosca
45 7 Bryan Cosca
For example, pipeline template with script parameters:
46 7 Bryan Cosca
47 7 Bryan Cosca
48 7 Bryan Cosca
49 7 Bryan Cosca
50 7 Bryan Cosca
51 7 Bryan Cosca
52 7 Bryan Cosca
53 7 Bryan Cosca
54 7 Bryan Cosca
55 7 Bryan Cosca
56 7 Bryan Cosca
57 7 Bryan Cosca
58 7 Bryan Cosca
59 7 Bryan Cosca
60 7 Bryan Cosca
 "title":"Input FASTQ Collection",
61 7 Bryan Cosca
 "description":"Input the fastq collection for BWA mem"
62 7 Bryan Cosca
63 7 Bryan Cosca
64 1 Bryan Cosca
65 1 Bryan Cosca
66 1 Bryan Cosca
67 10 Bryan Cosca
68 10 Bryan Cosca
69 10 Bryan Cosca
70 10 Bryan Cosca
71 10 Bryan Cosca
72 10 Bryan Cosca
73 10 Bryan Cosca
74 10 Bryan Cosca
75 7 Bryan Cosca
76 7 Bryan Cosca
77 1 Bryan Cosca
78 1 Bryan Cosca
79 1 Bryan Cosca
80 1 Bryan Cosca
81 10 Bryan Cosca
which creates this pipeline instance:
82 7 Bryan Cosca
83 10 Bryan Cosca
84 7 Bryan Cosca
85 11 Bryan Cosca
The inputs tab in the pipeline instance page shows all the required parameters. You can click 'Choose' to grab a collection from a project for the reference_collection and input FASTQ Collection parameters. You can type in the read_group and extra_number you want to use here as well. You can change the bwa_collection, but since you set the default collection, you only need to change it when you need to. 
86 1 Bryan Cosca
87 10 Bryan Cosca
The "Components" tab in the pipeline instance page shows all the parameters. Thus it is the only place where non-required parameters, such as 'additional_params' may be set.
88 1 Bryan Cosca
89 10 Bryan Cosca
90 1 Bryan Cosca
91 1 Bryan Cosca
h3. Writing runtime_constraints
92 1 Bryan Cosca
93 8 Bryan Cosca
"Runtime_constraints":http://doc.arvados.org/api/schema/Job.html are inputs in your job that help choose node parameters that your pipeline will run on. Optimizing these parameters can be found in the "Pipeline_Optimization wiki.":https://dev.arvados.org/projects/arvados/wiki/Pipeline_Optimization
94 7 Bryan Cosca
95 12 Bryan Cosca
The "docker_image":http://doc.arvados.org/api/schema/Job.html runtime constraint controls the docker image used to run your job. If not specified, the arvados/jobs image gets used. The base resources you need for a docker image to run in arvados can be found "here.":https://dev.arvados.org/projects/arvados/repository/revisions/master/entry/docker/base/Dockerfile
96 2 Bryan Cosca
97 10 Bryan Cosca
It is suggested that while developing you use the latest version of the image, which you can specify by using the name of the image. When in production, you should use the portable data hash of the image you specifically want to use to avoid problems when accidentally changing the image or other conflicts.
98 2 Bryan Cosca
99 10 Bryan Cosca
Using min_nodes will spin up as many nodes as you've specified for your job. Be warned that you can allocate your entire cluster to your job, so use this with caution.
100 10 Bryan Cosca
101 15 Sarah Guthrie
The max_tasks_per_node parameter will allow you to allocate more tasks on your node. By default, this is 1. If you are under utilizing your nodes, you can try to increase this number. For example, setting max_task_per_node to 4 will allow 4 tasks to run on one compute node. If there are more tasks to be scheduled, they will be queued until a compute node is free. The total amount of compute nodes set to your job is specified using min_nodes. Currently only tasks from the same job will be scheduled on the same node. Multiple jobs on the same node are on the roadmap for Crunch v2.
102 10 Bryan Cosca
103 10 Bryan Cosca
Keep in mind that the total CPU/RAM/space usage of your tasks should fit on your node. It's very easy to overestimate the compute power of your tasks. Using something like "crunchstat-summary":https://dev.arvados.org/projects/arvados/wiki/Pipeline_Optimization should help bridge this gap.