Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Version 12

Sarah Guthrie, 04/06/2016 08:46 PM

1 1 Sarah Guthrie
{{>toc}}
2
3
h1. Writing a Script Calling a Third Party Tool
4
5
Case study: FastQC
6
7
Good tips include:
8
* Keep the Dockerfile in the git repository
9
10
h3. Writing a Dockerfile
11
12 12 Sarah Guthrie
We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them. 
13
14 1 Sarah Guthrie
Docker has some wonderful documentation for building Dockerfiles:
15
* A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
16
* Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
17
18
From Docker:
19
20
"""
21
Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
22
23
This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
24
"""
25
26
<pre>
27
FROM arvados/jobs
28
29
USER root
30
31
RUN apt-get -q update && apt-get -qy install \
32
  fontconfig \
33
  openjdk-6-jre-headless \
34
  perl \
35
  unzip \
36
  wget
37
38
USER crunch
39
40
RUN mkdir /home/crunch/fastqc
41
RUN cd /home/crunch/fastqc && \
42
    wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
43
    unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
44
45
</pre>
46
47
h3. How to build a docker image from a Dockerfile
48
49
<pre>
50
docker build -t username/imagename path/to/Dockerfile/
51
</pre>
52
53
h3. How to upload a docker image to Arvados
54
55
<pre>
56
arv keep put username/imagename
57
</pre>
58
59
h3. How to call an external tool from a crunch script
60
61 3 Sarah Guthrie
We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output.
62 1 Sarah Guthrie
63
<pre>
64
import subprocess
65 2 Sarah Guthrie
foo = subprocess.check_output(['echo','foo'])
66 1 Sarah Guthrie
</pre>
67
68 3 Sarah Guthrie
If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully.
69 1 Sarah Guthrie
70
<pre>
71
import subprocess
72
with open('foo', 'w') as outfile:
73
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
74
</pre>
75 2 Sarah Guthrie
76
FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@
77
78
<pre>
79
import subprocess
80
import arvados
81
82
#Grab the file path pointing to the file to run fastqc on 
83
fastq_file = arvados.getjobparam('input_fastq_file')
84
85
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]
86
subprocess.check_call(cmd)
87
</pre>
88 1 Sarah Guthrie
89
h3. Where to put temporary files
90 6 Sarah Guthrie
91 5 Sarah Guthrie
<pre>
92
import arvados
93
94
task = arvados.current_task()
95
tmpdir = task.tmpdir
96
</pre>
97
98
Inside the code:
99 1 Sarah Guthrie
100
<pre>
101 4 Sarah Guthrie
import subprocess
102 1 Sarah Guthrie
import arvados
103 4 Sarah Guthrie
104 1 Sarah Guthrie
task = arvados.current_task()
105
tmpdir = task.tmpdir
106
107 4 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
108
fastq_file = arvados.getjobparam('input_fastq_file')
109
110
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]
111
subprocess.check_call(cmd)
112
113 1 Sarah Guthrie
</pre>
114
115
h3. How to write data directly to Keep (Using TaskOutputDir)
116
117
<pre>
118 8 Sarah Guthrie
import arvados
119
import arvados.crunch
120
121
outdir = arvados.crunch.TaskOutputDir()
122
123
# Write to outdir.path
124
125
arvados.task_set_output(outdir.manifest_text())
126
</pre>
127
128
Inside the code:
129
130
<pre>
131 7 Sarah Guthrie
import subprocess
132 1 Sarah Guthrie
import arvados
133
import arvados.crunch
134 7 Sarah Guthrie
135 1 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
136
137 7 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
138
fastq_file = arvados.getjobparam('input_fastq_file')
139
140
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path]
141
subprocess.check_call(cmd)
142 1 Sarah Guthrie
143
arvados.task_set_output(outdir.manifest_text())
144
</pre>
145
146
h3. When TaskOutputDir is not the correct choice
147
148
* If the tool writes symbolic links or named pipes, which are not supported by fuse
149
* If the I/O access patterns are not performant with fuse
150
** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
151 9 Sarah Guthrie
152
Open a collection writer, write files and/or directory trees:
153
154 1 Sarah Guthrie
<pre>
155
import arvados
156 9 Sarah Guthrie
157
collection_writer = arvados.collection.CollectionWriter()
158
collection_writer.write_file('foo.txt')
159
collection_writer.write_directory_tree(bar_directory_path)
160
arvados.task_set_output(collection_writer.finish())
161
</pre>
162
163
Inside the code:
164
165
<pre>
166
import subprocess
167
import arvados
168 1 Sarah Guthrie
import os
169 9 Sarah Guthrie
170 1 Sarah Guthrie
task = arvados.current_task()
171
tmpdir = task.tmpdir
172
173 9 Sarah Guthrie
outdir_path = os.path.join(tmpdir, 'out')
174
os.mkdir(outdir_path)
175 1 Sarah Guthrie
176 9 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
177
fastq_file = arvados.getjobparam('input_fastq_file')
178
179
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path]
180
subprocess.check_call(cmd)
181 1 Sarah Guthrie
182
collection_writer = arvados.collection.CollectionWriter()
183 10 Sarah Guthrie
collection_writer.write_file('foo.txt')
184
collection_writer.write_directory_tree(outdir_path)
185 1 Sarah Guthrie
arvados.task_set_output(collection_writer.finish())
186
187
</pre>
188
189
h3. Putting it all together
190
191
<pre>
192
import subprocess
193 11 Sarah Guthrie
import arvados
194
import arvados.crunch
195 1 Sarah Guthrie
196 11 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
197 1 Sarah Guthrie
198 11 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
199
fastq_file = arvados.getjobparam('input_fastq_file')
200 1 Sarah Guthrie
201 11 Sarah Guthrie
#Grab the number of threads available
202
num_threads = multiprocessing.cpu_count()
203
204
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)]
205
subprocess.check_call(cmd)
206
207
arvados.task_set_output(outdir.manifest_text())
208 1 Sarah Guthrie
</pre>