Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Version 13

Sarah Guthrie, 04/06/2016 08:46 PM

1 1 Sarah Guthrie
{{>toc}}
2
3
h1. Writing a Script Calling a Third Party Tool
4
5
Case study: FastQC
6
7
h3. Writing a Dockerfile
8
9 12 Sarah Guthrie
We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them. 
10
11 1 Sarah Guthrie
Docker has some wonderful documentation for building Dockerfiles:
12
* A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
13
* Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
14
15
From Docker:
16
17
"""
18
Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
19
20
This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
21
"""
22
23
<pre>
24
FROM arvados/jobs
25
26
USER root
27
28
RUN apt-get -q update && apt-get -qy install \
29
  fontconfig \
30
  openjdk-6-jre-headless \
31
  perl \
32
  unzip \
33
  wget
34
35
USER crunch
36
37
RUN mkdir /home/crunch/fastqc
38
RUN cd /home/crunch/fastqc && \
39
    wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
40
    unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
41
42
</pre>
43
44
h3. How to build a docker image from a Dockerfile
45
46
<pre>
47
docker build -t username/imagename path/to/Dockerfile/
48
</pre>
49
50
h3. How to upload a docker image to Arvados
51
52
<pre>
53
arv keep put username/imagename
54
</pre>
55
56
h3. How to call an external tool from a crunch script
57
58 3 Sarah Guthrie
We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output.
59 1 Sarah Guthrie
60
<pre>
61
import subprocess
62 2 Sarah Guthrie
foo = subprocess.check_output(['echo','foo'])
63 1 Sarah Guthrie
</pre>
64
65 3 Sarah Guthrie
If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully.
66 1 Sarah Guthrie
67
<pre>
68
import subprocess
69
with open('foo', 'w') as outfile:
70
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
71
</pre>
72 2 Sarah Guthrie
73
FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@
74
75
<pre>
76
import subprocess
77
import arvados
78
79
#Grab the file path pointing to the file to run fastqc on 
80
fastq_file = arvados.getjobparam('input_fastq_file')
81
82
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]
83
subprocess.check_call(cmd)
84
</pre>
85 1 Sarah Guthrie
86
h3. Where to put temporary files
87 6 Sarah Guthrie
88 5 Sarah Guthrie
<pre>
89
import arvados
90
91
task = arvados.current_task()
92
tmpdir = task.tmpdir
93
</pre>
94
95
Inside the code:
96 1 Sarah Guthrie
97
<pre>
98 4 Sarah Guthrie
import subprocess
99 1 Sarah Guthrie
import arvados
100 4 Sarah Guthrie
101 1 Sarah Guthrie
task = arvados.current_task()
102
tmpdir = task.tmpdir
103
104 4 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
105
fastq_file = arvados.getjobparam('input_fastq_file')
106
107
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]
108
subprocess.check_call(cmd)
109
110 1 Sarah Guthrie
</pre>
111
112
h3. How to write data directly to Keep (Using TaskOutputDir)
113
114
<pre>
115 8 Sarah Guthrie
import arvados
116
import arvados.crunch
117
118
outdir = arvados.crunch.TaskOutputDir()
119
120
# Write to outdir.path
121
122
arvados.task_set_output(outdir.manifest_text())
123
</pre>
124
125
Inside the code:
126
127
<pre>
128 7 Sarah Guthrie
import subprocess
129 1 Sarah Guthrie
import arvados
130
import arvados.crunch
131 7 Sarah Guthrie
132 1 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
133
134 7 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
135
fastq_file = arvados.getjobparam('input_fastq_file')
136
137
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path]
138
subprocess.check_call(cmd)
139 1 Sarah Guthrie
140
arvados.task_set_output(outdir.manifest_text())
141
</pre>
142
143
h3. When TaskOutputDir is not the correct choice
144
145
* If the tool writes symbolic links or named pipes, which are not supported by fuse
146
* If the I/O access patterns are not performant with fuse
147
** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
148 9 Sarah Guthrie
149
Open a collection writer, write files and/or directory trees:
150
151 1 Sarah Guthrie
<pre>
152
import arvados
153 9 Sarah Guthrie
154
collection_writer = arvados.collection.CollectionWriter()
155
collection_writer.write_file('foo.txt')
156
collection_writer.write_directory_tree(bar_directory_path)
157
arvados.task_set_output(collection_writer.finish())
158
</pre>
159
160
Inside the code:
161
162
<pre>
163
import subprocess
164
import arvados
165 1 Sarah Guthrie
import os
166 9 Sarah Guthrie
167 1 Sarah Guthrie
task = arvados.current_task()
168
tmpdir = task.tmpdir
169
170 9 Sarah Guthrie
outdir_path = os.path.join(tmpdir, 'out')
171
os.mkdir(outdir_path)
172 1 Sarah Guthrie
173 9 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
174
fastq_file = arvados.getjobparam('input_fastq_file')
175
176
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path]
177
subprocess.check_call(cmd)
178 1 Sarah Guthrie
179
collection_writer = arvados.collection.CollectionWriter()
180 10 Sarah Guthrie
collection_writer.write_file('foo.txt')
181
collection_writer.write_directory_tree(outdir_path)
182 1 Sarah Guthrie
arvados.task_set_output(collection_writer.finish())
183
184
</pre>
185
186
h3. Putting it all together
187
188
<pre>
189
import subprocess
190 11 Sarah Guthrie
import arvados
191
import arvados.crunch
192 1 Sarah Guthrie
193 11 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
194 1 Sarah Guthrie
195 11 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
196
fastq_file = arvados.getjobparam('input_fastq_file')
197 1 Sarah Guthrie
198 11 Sarah Guthrie
#Grab the number of threads available
199
num_threads = multiprocessing.cpu_count()
200
201
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)]
202
subprocess.check_call(cmd)
203
204
arvados.task_set_output(outdir.manifest_text())
205 1 Sarah Guthrie
</pre>