Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Version 22

Sarah Guthrie, 04/08/2016 11:34 PM

1 1 Sarah Guthrie
{{>toc}}
2
3
h1. Writing a Script Calling a Third Party Tool
4
5 16 Sarah Guthrie
h2. Case study: FastQC
6 1 Sarah Guthrie
7 16 Sarah Guthrie
# Building an environment able to run FastQC
8
## Writing a Dockerfile 
9
## Building a docker image from the Dockerfile
10
## Uploading the docker image to an Arvados instance
11
# Writing a crunch script that runs FastQC (in the docker image)
12
## Calling FastQC
13
## Where to place temporary files
14
## Writing output data
15
# Writing a pipeline template to run the crunch script
16
17 1 Sarah Guthrie
h3. Writing a Dockerfile
18
19 16 Sarah Guthrie
Dockerfiles, as explained by docker:
20 1 Sarah Guthrie
21 16 Sarah Guthrie
> Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
22
> 
23
> This page (https://docs.docker.com/engine/reference/builder/) describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
24
25
Docker has some wonderful documentation for building Dockerfiles which we recommend you look at for instructions on getting the finished product below:
26 14 Sarah Guthrie
* A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
27
* Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
28 1 Sarah Guthrie
29 16 Sarah Guthrie
We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them. 
30 15 Sarah Guthrie
31 1 Sarah Guthrie
Dockerfile that installs FastQC:
32
<pre>
33
FROM arvados/jobs
34
35
USER root
36
37
RUN apt-get -q update && apt-get -qy install \
38
  fontconfig \
39
  openjdk-6-jre-headless \
40
  perl \
41
  unzip \
42
  wget
43
44
USER crunch
45
46
RUN mkdir /home/crunch/fastqc
47
RUN cd /home/crunch/fastqc && \
48
    wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
49
    unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
50
</pre>
51
52
h3. How to build a docker image from a Dockerfile
53
54 16 Sarah Guthrie
Once you have a Dockerfile, you can use the @docker build@ command to build the image using the Dockerfile instructions.
55
56 1 Sarah Guthrie
<pre>
57
docker build -t username/imagename path/to/Dockerfile/
58
</pre>
59
60
h3. How to upload a docker image to Arvados
61
62 16 Sarah Guthrie
Once the docker image is built, you can use the arvados cli (http://doc.arvados.org/sdk/cli/index.html) command @arv keep docker@ to upload the image to an Arvados cluster.
63
64 1 Sarah Guthrie
<pre>
65 16 Sarah Guthrie
arv keep docker username/imagename
66 1 Sarah Guthrie
</pre>
67
68
h3. How to call an external tool from a crunch script
69
70 3 Sarah Guthrie
We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output.
71 1 Sarah Guthrie
72
<pre>
73
import subprocess
74 2 Sarah Guthrie
foo = subprocess.check_output(['echo','foo'])
75 1 Sarah Guthrie
</pre>
76
77 3 Sarah Guthrie
If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully.
78 1 Sarah Guthrie
79
<pre>
80
import subprocess
81
with open('foo', 'w') as outfile:
82
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
83
</pre>
84 2 Sarah Guthrie
85
FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@
86
87
<pre>
88
import subprocess
89
import arvados
90
91
#Grab the file path pointing to the file to run fastqc on 
92
fastq_file = arvados.getjobparam('input_fastq_file')
93
94
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]
95
subprocess.check_call(cmd)
96
</pre>
97 1 Sarah Guthrie
98
h3. Where to put temporary files
99 6 Sarah Guthrie
100 5 Sarah Guthrie
<pre>
101
import arvados
102
103
task = arvados.current_task()
104
tmpdir = task.tmpdir
105
</pre>
106
107
Inside the code:
108 1 Sarah Guthrie
109
<pre>
110 4 Sarah Guthrie
import subprocess
111 1 Sarah Guthrie
import arvados
112 4 Sarah Guthrie
113 1 Sarah Guthrie
task = arvados.current_task()
114
tmpdir = task.tmpdir
115
116 4 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
117
fastq_file = arvados.getjobparam('input_fastq_file')
118
119
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]
120
subprocess.check_call(cmd)
121 1 Sarah Guthrie
</pre>
122
123
h3. How to write data directly to Keep (Using TaskOutputDir)
124
125
<pre>
126 8 Sarah Guthrie
import arvados
127
import arvados.crunch
128
129
outdir = arvados.crunch.TaskOutputDir()
130
131
# Write to outdir.path
132
133
arvados.task_set_output(outdir.manifest_text())
134
</pre>
135
136
Inside the code:
137
138
<pre>
139 7 Sarah Guthrie
import subprocess
140 1 Sarah Guthrie
import arvados
141
import arvados.crunch
142 7 Sarah Guthrie
143 1 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
144
145 7 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
146
fastq_file = arvados.getjobparam('input_fastq_file')
147
148
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path]
149
subprocess.check_call(cmd)
150 1 Sarah Guthrie
151
arvados.task_set_output(outdir.manifest_text())
152
</pre>
153
154
h3. When TaskOutputDir is not the correct choice
155
156
* If the tool writes symbolic links or named pipes, which are not supported by fuse
157
* If the I/O access patterns are not performant with fuse
158
** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
159 9 Sarah Guthrie
160
Open a collection writer, write files and/or directory trees:
161
162 1 Sarah Guthrie
<pre>
163
import arvados
164 9 Sarah Guthrie
165
collection_writer = arvados.collection.CollectionWriter()
166
collection_writer.write_file('foo.txt')
167
collection_writer.write_directory_tree(bar_directory_path)
168
arvados.task_set_output(collection_writer.finish())
169
</pre>
170
171
Inside the code:
172
173
<pre>
174
import subprocess
175 1 Sarah Guthrie
import arvados
176 9 Sarah Guthrie
import os
177 1 Sarah Guthrie
178
task = arvados.current_task()
179
tmpdir = task.tmpdir
180 9 Sarah Guthrie
181
outdir_path = os.path.join(tmpdir, 'out')
182 1 Sarah Guthrie
os.mkdir(outdir_path)
183 9 Sarah Guthrie
184 1 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
185 9 Sarah Guthrie
fastq_file = arvados.getjobparam('input_fastq_file')
186
187
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path]
188
subprocess.check_call(cmd)
189 1 Sarah Guthrie
190
collection_writer = arvados.collection.CollectionWriter()
191 10 Sarah Guthrie
collection_writer.write_file('foo.txt')
192
collection_writer.write_directory_tree(outdir_path)
193 1 Sarah Guthrie
arvados.task_set_output(collection_writer.finish())
194
</pre>
195 18 Sarah Guthrie
196 16 Sarah Guthrie
h3. The final crunch script
197 1 Sarah Guthrie
198 21 Sarah Guthrie
*fastqc.py*
199 1 Sarah Guthrie
<pre>
200
import subprocess
201 11 Sarah Guthrie
import arvados
202
import arvados.crunch
203 1 Sarah Guthrie
204
outdir = arvados.crunch.TaskOutputDir()
205
206
#Grab the file path pointing to the file to run fastqc on 
207
fastq_file = arvados.getjobparam('input_fastq_file')
208 11 Sarah Guthrie
209 1 Sarah Guthrie
#Grab the number of threads available
210 11 Sarah Guthrie
num_threads = multiprocessing.cpu_count()
211
212 1 Sarah Guthrie
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)]
213 11 Sarah Guthrie
subprocess.check_call(cmd)
214
215
arvados.task_set_output(outdir.manifest_text())
216
</pre>
217 16 Sarah Guthrie
218
h3. Writing a pipeline template to run the crunch script
219
220 20 Sarah Guthrie
Now we need to write a pipeline template that specifies this crunch_script and the docker image we created earlier. Like the Dockerfile, even though Arvados relies on the pipeline template on the API server, keeping the pipeline template in the same repository helps maintain the code and helps ensure changes to the code are reflected in the pipeline template.
221
222
Using the call @arv create pipeline_template@, we can create the following pipeline template. 
223
224
<pre>
225
{
226
  "name": "FastQC Pipeline",
227
  "components": {
228
    "Run-FastQC": {
229
      "repository": "repository/name",
230
      "script": "fastqc.py",
231
      "script_version": "master",
232
      "script_parameters": {
233
        "input": {
234
          "dataclass": "Collection",
235
          "required": true,
236
          "title": "Input Paired FASTQ RNA-Seq files"
237
        }
238
      },
239
      "runtime_constraints": {
240
        "docker_image": "username/imagename",
241
        "max_tasks_per_node": 1
242
      }
243
    }
244
  }
245
}
246
</pre>
247
248
For further information about managing a pipeline template, see [[Git_strategy_for_pipeline_development]].