Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Version 10

Sarah Guthrie, 04/06/2016 08:35 PM

1 1 Sarah Guthrie
{{>toc}}
2
3
h1. Writing a Script Calling a Third Party Tool
4
5
Case study: FastQC
6
7
Good tips include:
8
* Keep the Dockerfile in the git repository
9
10
h3. Writing a Dockerfile
11
12
Docker has some wonderful documentation for building Dockerfiles:
13
* A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
14
* Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
15
16
From Docker:
17
18
"""
19
Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
20
21
This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
22
"""
23
24
<pre>
25
FROM arvados/jobs
26
27
USER root
28
29
RUN apt-get -q update && apt-get -qy install \
30
  fontconfig \
31
  openjdk-6-jre-headless \
32
  perl \
33
  unzip \
34
  wget
35
36
USER crunch
37
38
RUN mkdir /home/crunch/fastqc
39
RUN cd /home/crunch/fastqc && \
40
    wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
41
    unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
42
43
</pre>
44
45
h3. How to build a docker image from a Dockerfile
46
47
<pre>
48
docker build -t username/imagename path/to/Dockerfile/
49
</pre>
50
51
h3. How to upload a docker image to Arvados
52
53
<pre>
54
arv keep put username/imagename
55
</pre>
56
57
h3. How to call an external tool from a crunch script
58
59 3 Sarah Guthrie
We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output.
60 1 Sarah Guthrie
61
<pre>
62
import subprocess
63 2 Sarah Guthrie
foo = subprocess.check_output(['echo','foo'])
64 1 Sarah Guthrie
</pre>
65
66 3 Sarah Guthrie
If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully.
67 1 Sarah Guthrie
68
<pre>
69
import subprocess
70
with open('foo', 'w') as outfile:
71
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
72
</pre>
73 2 Sarah Guthrie
74
FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@
75
76
<pre>
77
import subprocess
78
import arvados
79
80
#Grab the file path pointing to the file to run fastqc on 
81
fastq_file = arvados.getjobparam('input_fastq_file')
82
83
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]
84
subprocess.check_call(cmd)
85
</pre>
86 1 Sarah Guthrie
87
h3. Where to put temporary files
88 6 Sarah Guthrie
89 5 Sarah Guthrie
<pre>
90
import arvados
91
92
task = arvados.current_task()
93
tmpdir = task.tmpdir
94
</pre>
95
96
Inside the code:
97 1 Sarah Guthrie
98
<pre>
99 4 Sarah Guthrie
import subprocess
100 1 Sarah Guthrie
import arvados
101 4 Sarah Guthrie
102 1 Sarah Guthrie
task = arvados.current_task()
103
tmpdir = task.tmpdir
104
105 4 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
106
fastq_file = arvados.getjobparam('input_fastq_file')
107
108
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]
109
subprocess.check_call(cmd)
110
111 1 Sarah Guthrie
</pre>
112
113
h3. How to write data directly to Keep (Using TaskOutputDir)
114
115
<pre>
116 8 Sarah Guthrie
import arvados
117
import arvados.crunch
118
119
outdir = arvados.crunch.TaskOutputDir()
120
121
# Write to outdir.path
122
123
arvados.task_set_output(outdir.manifest_text())
124
</pre>
125
126
Inside the code:
127
128
<pre>
129 7 Sarah Guthrie
import subprocess
130 1 Sarah Guthrie
import arvados
131
import arvados.crunch
132 7 Sarah Guthrie
133 1 Sarah Guthrie
outdir = arvados.crunch.TaskOutputDir()
134
135 7 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
136
fastq_file = arvados.getjobparam('input_fastq_file')
137
138
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path]
139
subprocess.check_call(cmd)
140 1 Sarah Guthrie
141
arvados.task_set_output(outdir.manifest_text())
142
</pre>
143
144
h3. When TaskOutputDir is not the correct choice
145
146
* If the tool writes symbolic links or named pipes, which are not supported by fuse
147
* If the I/O access patterns are not performant with fuse
148
** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
149 9 Sarah Guthrie
150
Open a collection writer, write files and/or directory trees:
151
152 1 Sarah Guthrie
<pre>
153
import arvados
154 9 Sarah Guthrie
155
collection_writer = arvados.collection.CollectionWriter()
156
collection_writer.write_file('foo.txt')
157
collection_writer.write_directory_tree(bar_directory_path)
158
arvados.task_set_output(collection_writer.finish())
159
</pre>
160
161
Inside the code:
162
163
<pre>
164
import subprocess
165
import arvados
166 1 Sarah Guthrie
import os
167 9 Sarah Guthrie
168 1 Sarah Guthrie
task = arvados.current_task()
169
tmpdir = task.tmpdir
170
171 9 Sarah Guthrie
outdir_path = os.path.join(tmpdir, 'out')
172
os.mkdir(outdir_path)
173 1 Sarah Guthrie
174 9 Sarah Guthrie
#Grab the file path pointing to the file to run fastqc on 
175
fastq_file = arvados.getjobparam('input_fastq_file')
176
177
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path]
178
subprocess.check_call(cmd)
179 1 Sarah Guthrie
180
collection_writer = arvados.collection.CollectionWriter()
181 10 Sarah Guthrie
collection_writer.write_file('foo.txt')
182
collection_writer.write_directory_tree(outdir_path)
183 1 Sarah Guthrie
arvados.task_set_output(collection_writer.finish())
184
185
</pre>
186
187
h3. Putting it all together
188
189
<pre>
190
import subprocess
191
192
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc']
193
fq_files = sorted(glob.glob('*.fq*'))
194
fastq_files = sorted(glob.glob('*.fastq*'))
195
cmd.extend(fq_files+fastq_files)
196
cmd.extend(['-o', outdirpath, '-t', str(num_threads)])
197
fastqc_pipe = subprocess.Popen(cmd)
198
fastqc_pipe.wait()
199
200
coll_writer = arvados.CollectionWriter()
201
coll_writer.write_directory_tree(outdirpath)
202
pdh = coll_writer.finish()
203
204
body = {'output':pdh, 'success':fastqc_pipe.returncode==0, 'progress':1.0}
205
arvados.api('v1').job_tasks().update(uuid=this_task['uuid'], body=body).execute()
206
</pre>