Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Version 16

Sarah Guthrie, 04/07/2016 10:22 PM

1 1 Sarah Guthrie
{{>toc}}
2
3
h1. Writing a Script Calling a Third Party Tool
4
5 16 Sarah Guthrie
h2. Case study: FastQC
6 1 Sarah Guthrie
7 16 Sarah Guthrie
# Building an environment able to run FastQC
8
## Writing a Dockerfile 
9
## Building a docker image from the Dockerfile
10
## Uploading the docker image to an Arvados instance
11
# Writing a crunch script that runs FastQC (in the docker image)
12
## Calling FastQC
13
## Where to place temporary files
14
## Writing output data
15
# Writing a pipeline template to run the crunch script
16
17 1 Sarah Guthrie
h3. Writing a Dockerfile
18
19 16 Sarah Guthrie
Dockerfiles, as explained by docker:
20 1 Sarah Guthrie
21 16 Sarah Guthrie
> Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
22
> 
23
> This page (https://docs.docker.com/engine/reference/builder/) describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
24
25
Docker has some wonderful documentation for building Dockerfiles which we recommend you look at for instructions on getting the finished product below:
26 14 Sarah Guthrie
* A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
27
* Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
28 1 Sarah Guthrie
29 16 Sarah Guthrie
We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them. 
30 15 Sarah Guthrie
31 1 Sarah Guthrie
Dockerfile that installs FastQC:
32
<pre>
33
FROM arvados/jobs
34
35
USER root
36
37
RUN apt-get -q update && apt-get -qy install \
38
  fontconfig \
39
  openjdk-6-jre-headless \
40
  perl \
41
  unzip \
42
  wget
43
44
USER crunch
45
46
RUN mkdir /home/crunch/fastqc
47
RUN cd /home/crunch/fastqc && \
48
    wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
49
    unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
50
</pre>
51
52
h3. How to build a docker image from a Dockerfile
53
54 16 Sarah Guthrie
Once you have a Dockerfile, you can use the @docker build@ command to build the image using the Dockerfile instructions.
55
56 1 Sarah Guthrie
<pre>
57
docker build -t username/imagename path/to/Dockerfile/
58
</pre>
59
60
h3. How to upload a docker image to Arvados
61
62 16 Sarah Guthrie
Once the docker image is built, you can use the arvados cli (http://doc.arvados.org/sdk/cli/index.html) command @arv keep docker@ to upload the image to an Arvados cluster.
63
64 1 Sarah Guthrie
<pre>
65 16 Sarah Guthrie
arv keep docker username/imagename
66 1 Sarah Guthrie
</pre>
67
68
h3. How to call an external tool from a crunch script
69
70 3 Sarah Guthrie
We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output.
71 1 Sarah Guthrie
72
<pre>
73
import subprocess
74 2 Sarah Guthrie
foo = subprocess.check_output(['echo','foo'])
75 1 Sarah Guthrie
</pre>
76
77 3 Sarah Guthrie
If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully.
78 1 Sarah Guthrie
79
<pre>
80
import subprocess
81
with open('foo', 'w') as outfile:
82
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
83
</pre>
84 2 Sarah Guthrie
85
FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@
86
87
<pre>
88
import subprocess
89
import arvados
90
91
#Grab the file path pointing to the file to run fastqc on 
92
fastq_file = arvados.getjobparam('input_fastq_file')
93
94
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]
95
subprocess.check_call(cmd)
96
</pre>
97 1 Sarah Guthrie
98
h3. Where to put temporary files
99 6 Sarah Guthrie
100 5 Sarah Guthrie
<pre>
101
import arvados
102
103
task = arvados.current_task()
104
tmpdir = task.tmpdir
105
</pre>
106
107
Inside the code:
108 1 Sarah Guthrie
109
<pre>
110 4 Sarah Guthrie
import subprocess
111 1 Sarah Guthrie
import arvados
112 4 Sarah Guthrie
113 1 Sarah Guthrie
task = arvados.current_task()
114
tmpdir = task.tmpdir
115
116 4 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
117
fastq_file = arvados.getjobparam('input_fastq_file')
118
119
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]
120
subprocess.check_call(cmd)
121
122 1 Sarah Guthrie
</pre>
123
124
h3. How to write data directly to Keep (Using TaskOutputDir)
125
126
<pre>
127 8 Sarah Guthrie
import arvados
128
import arvados.crunch
129
130
outdir = arvados.crunch.TaskOutputDir()
131
132
# Write to outdir.path
133
134
arvados.task_set_output(outdir.manifest_text())
135
</pre>
136
137
Inside the code:
138
139
<pre>
140 7 Sarah Guthrie
import subprocess
141 1 Sarah Guthrie
import arvados
142
import arvados.crunch
143 7 Sarah Guthrie
144 1 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
145
146 7 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
147
fastq_file = arvados.getjobparam('input_fastq_file')
148
149
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path]
150
subprocess.check_call(cmd)
151 1 Sarah Guthrie
152
arvados.task_set_output(outdir.manifest_text())
153
</pre>
154
155
h3. When TaskOutputDir is not the correct choice
156
157
* If the tool writes symbolic links or named pipes, which are not supported by fuse
158
* If the I/O access patterns are not performant with fuse
159
** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
160 9 Sarah Guthrie
161
Open a collection writer, write files and/or directory trees:
162
163 1 Sarah Guthrie
<pre>
164
import arvados
165 9 Sarah Guthrie
166
collection_writer = arvados.collection.CollectionWriter()
167
collection_writer.write_file('foo.txt')
168
collection_writer.write_directory_tree(bar_directory_path)
169
arvados.task_set_output(collection_writer.finish())
170
</pre>
171
172
Inside the code:
173
174
<pre>
175
import subprocess
176 1 Sarah Guthrie
import arvados
177 9 Sarah Guthrie
import os
178 1 Sarah Guthrie
179
task = arvados.current_task()
180
tmpdir = task.tmpdir
181 9 Sarah Guthrie
182
outdir_path = os.path.join(tmpdir, 'out')
183 1 Sarah Guthrie
os.mkdir(outdir_path)
184 9 Sarah Guthrie
185 1 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
186 9 Sarah Guthrie
fastq_file = arvados.getjobparam('input_fastq_file')
187
188
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path]
189
subprocess.check_call(cmd)
190 1 Sarah Guthrie
191
collection_writer = arvados.collection.CollectionWriter()
192 10 Sarah Guthrie
collection_writer.write_file('foo.txt')
193
collection_writer.write_directory_tree(outdir_path)
194 1 Sarah Guthrie
arvados.task_set_output(collection_writer.finish())
195
196
</pre>
197
198 16 Sarah Guthrie
h3. The final crunch script
199 1 Sarah Guthrie
200
<pre>
201
import subprocess
202 11 Sarah Guthrie
import arvados
203
import arvados.crunch
204 1 Sarah Guthrie
205
outdir = arvados.crunch.TaskOutputDir()
206
207
#Grab the file path pointing to the file to run fastqc on 
208
fastq_file = arvados.getjobparam('input_fastq_file')
209 11 Sarah Guthrie
210 1 Sarah Guthrie
#Grab the number of threads available
211 11 Sarah Guthrie
num_threads = multiprocessing.cpu_count()
212
213 1 Sarah Guthrie
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)]
214 11 Sarah Guthrie
subprocess.check_call(cmd)
215
216
arvados.task_set_output(outdir.manifest_text())
217
</pre>
218 16 Sarah Guthrie
219
h3. Writing a pipeline template to run the crunch script
220
221
...