Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Version 14

Sarah Guthrie, 04/06/2016 08:47 PM

1 1 Sarah Guthrie
{{>toc}}
2
3
h1. Writing a Script Calling a Third Party Tool
4
5
Case study: FastQC
6
7
h3. Writing a Dockerfile
8
9 12 Sarah Guthrie
We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them. 
10
11 1 Sarah Guthrie
Docker has some wonderful documentation for building Dockerfiles:
12
* A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
13
* Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
14
15
From Docker:
16
17 14 Sarah Guthrie
> Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
18
> 
19
> This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
20
> 
21 1 Sarah Guthrie
22
<pre>
23
FROM arvados/jobs
24
25
USER root
26
27
RUN apt-get -q update && apt-get -qy install \
28
  fontconfig \
29
  openjdk-6-jre-headless \
30
  perl \
31
  unzip \
32
  wget
33
34
USER crunch
35
36
RUN mkdir /home/crunch/fastqc
37
RUN cd /home/crunch/fastqc && \
38
    wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
39
    unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
40
41
</pre>
42
43
h3. How to build a docker image from a Dockerfile
44
45
<pre>
46
docker build -t username/imagename path/to/Dockerfile/
47
</pre>
48
49
h3. How to upload a docker image to Arvados
50
51
<pre>
52
arv keep put username/imagename
53
</pre>
54
55
h3. How to call an external tool from a crunch script
56
57 3 Sarah Guthrie
We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output.
58 1 Sarah Guthrie
59
<pre>
60
import subprocess
61 2 Sarah Guthrie
foo = subprocess.check_output(['echo','foo'])
62 1 Sarah Guthrie
</pre>
63
64 3 Sarah Guthrie
If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully.
65 1 Sarah Guthrie
66
<pre>
67
import subprocess
68
with open('foo', 'w') as outfile:
69
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
70
</pre>
71 2 Sarah Guthrie
72
FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@
73
74
<pre>
75
import subprocess
76
import arvados
77
78
#Grab the file path pointing to the file to run fastqc on 
79
fastq_file = arvados.getjobparam('input_fastq_file')
80
81
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]
82
subprocess.check_call(cmd)
83
</pre>
84 1 Sarah Guthrie
85
h3. Where to put temporary files
86 6 Sarah Guthrie
87 5 Sarah Guthrie
<pre>
88
import arvados
89
90
task = arvados.current_task()
91
tmpdir = task.tmpdir
92
</pre>
93
94
Inside the code:
95 1 Sarah Guthrie
96
<pre>
97 4 Sarah Guthrie
import subprocess
98 1 Sarah Guthrie
import arvados
99 4 Sarah Guthrie
100 1 Sarah Guthrie
task = arvados.current_task()
101
tmpdir = task.tmpdir
102
103 4 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
104
fastq_file = arvados.getjobparam('input_fastq_file')
105
106
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]
107
subprocess.check_call(cmd)
108
109 1 Sarah Guthrie
</pre>
110
111
h3. How to write data directly to Keep (Using TaskOutputDir)
112
113
<pre>
114 8 Sarah Guthrie
import arvados
115
import arvados.crunch
116
117
outdir = arvados.crunch.TaskOutputDir()
118
119
# Write to outdir.path
120
121
arvados.task_set_output(outdir.manifest_text())
122
</pre>
123
124
Inside the code:
125
126
<pre>
127 7 Sarah Guthrie
import subprocess
128 1 Sarah Guthrie
import arvados
129
import arvados.crunch
130 7 Sarah Guthrie
131 1 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
132
133 7 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
134
fastq_file = arvados.getjobparam('input_fastq_file')
135
136
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path]
137
subprocess.check_call(cmd)
138 1 Sarah Guthrie
139
arvados.task_set_output(outdir.manifest_text())
140
</pre>
141
142
h3. When TaskOutputDir is not the correct choice
143
144
* If the tool writes symbolic links or named pipes, which are not supported by fuse
145
* If the I/O access patterns are not performant with fuse
146
** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
147 9 Sarah Guthrie
148
Open a collection writer, write files and/or directory trees:
149
150 1 Sarah Guthrie
<pre>
151
import arvados
152 9 Sarah Guthrie
153
collection_writer = arvados.collection.CollectionWriter()
154
collection_writer.write_file('foo.txt')
155
collection_writer.write_directory_tree(bar_directory_path)
156
arvados.task_set_output(collection_writer.finish())
157
</pre>
158
159
Inside the code:
160
161
<pre>
162
import subprocess
163
import arvados
164 1 Sarah Guthrie
import os
165 9 Sarah Guthrie
166 1 Sarah Guthrie
task = arvados.current_task()
167
tmpdir = task.tmpdir
168
169 9 Sarah Guthrie
outdir_path = os.path.join(tmpdir, 'out')
170
os.mkdir(outdir_path)
171 1 Sarah Guthrie
172 9 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
173
fastq_file = arvados.getjobparam('input_fastq_file')
174
175
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path]
176
subprocess.check_call(cmd)
177 1 Sarah Guthrie
178
collection_writer = arvados.collection.CollectionWriter()
179 10 Sarah Guthrie
collection_writer.write_file('foo.txt')
180
collection_writer.write_directory_tree(outdir_path)
181 1 Sarah Guthrie
arvados.task_set_output(collection_writer.finish())
182
183
</pre>
184
185
h3. Putting it all together
186
187
<pre>
188
import subprocess
189 11 Sarah Guthrie
import arvados
190
import arvados.crunch
191 1 Sarah Guthrie
192 11 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
193 1 Sarah Guthrie
194 11 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
195
fastq_file = arvados.getjobparam('input_fastq_file')
196 1 Sarah Guthrie
197 11 Sarah Guthrie
#Grab the number of threads available
198
num_threads = multiprocessing.cpu_count()
199
200
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)]
201
subprocess.check_call(cmd)
202
203
arvados.task_set_output(outdir.manifest_text())
204 1 Sarah Guthrie
</pre>