Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Version 1

Sarah Guthrie, 04/06/2016 06:16 PM

1 1 Sarah Guthrie
{{>toc}}
2
3
h1. Writing a Script Calling a Third Party Tool
4
5
Case study: FastQC
6
7
Good tips include:
8
* Keep the Dockerfile in the git repository
9
10
h3. Writing a Dockerfile
11
12
Docker has some wonderful documentation for building Dockerfiles:
13
* A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
14
* Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
15
16
From Docker:
17
18
"""
19
Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
20
21
This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
22
"""
23
24
<pre>
25
FROM arvados/jobs
26
27
USER root
28
29
RUN apt-get -q update && apt-get -qy install \
30
  fontconfig \
31
  openjdk-6-jre-headless \
32
  perl \
33
  unzip \
34
  wget
35
36
USER crunch
37
38
RUN mkdir /home/crunch/fastqc
39
RUN cd /home/crunch/fastqc && \
40
    wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
41
    unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
42
43
</pre>
44
45
h3. How to build a docker image from a Dockerfile
46
47
<pre>
48
docker build -t username/imagename path/to/Dockerfile/
49
</pre>
50
51
h3. How to upload a docker image to Arvados
52
53
<pre>
54
arv keep put username/imagename
55
</pre>
56
57
h3. How to call an external tool from a crunch script
58
59
Usually this is most convenient:
60
61
<pre>
62
import subprocess
63
64
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc']
65
fq_files = sorted(glob.glob('*.fq*'))
66
fastq_files = sorted(glob.glob('*.fastq*'))
67
cmd.extend(fq_files+fastq_files)
68
cmd.extend(['-o', outdirpath, '-t', str(num_threads)])
69
fastqc_pipe = subprocess.Popen(cmd)
70
fastqc_pipe.wait()
71
72
coll_writer = arvados.CollectionWriter()
73
coll_writer.write_directory_tree(outdirpath)
74
pdh = coll_writer.finish()
75
76
body = {'output':pdh, 'success':fastqc_pipe.returncode==0, 'progress':1.0}
77
arvados.api('v1').job_tasks().update(uuid=this_task['uuid'], body=body).execute()
78
</pre>
79
80
If the output is big, redirect it to a file:
81
82
<pre>
83
import subprocess
84
with open('foo', 'w') as outfile:
85
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
86
</pre>
87
88
h3. Where to put temporary files
89
90
<pre>
91
import arvados
92
import os
93
task = arvados.current_task()
94
tmpdir = task.tmpdir
95
96
with open(os.path.join(tmpdir, 'foo'), 'w') as out:
97
</pre>
98
99
h3. How to write data directly to Keep (Using TaskOutputDir)
100
101
<pre>
102
import arvados
103
import arvados.crunch
104
import os
105
outdir = arvados.crunch.TaskOutputDir()
106
107
with open(os.path.join(outdir.path, 'foo'), 'w') as outfile:
108
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
109
110
arvados.task_set_output(outdir.manifest_text())
111
</pre>
112
113
h3. When TaskOutputDir is not the correct choice
114
115
* If the tool writes symbolic links or named pipes, which are not supported by fuse
116
* If the I/O access patterns are not performant with fuse
117
** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
118
 
119
<pre>
120
import arvados
121
import os
122
task = arvados.current_task()
123
tmpdir = task.tmpdir
124
125
os.mkdir(os.path.join(tmpdir, 'out'))
126
127
with open(os.path.join(tmpdir, 'out', 'foo.txt'), 'w') as out:
128
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
129
130
collection_writer = arvados.collection.CollectionWriter()
131
collection_writer.write_file('random_file.txt')
132
collection_writer.write_directory_tree(os.path.join(tmpdir, 'out'))
133
arvados.task_set_output(collection_writer.finish())
134
135
</pre>
136
137
h3. Putting it all together
138
139
<pre>
140
import subprocess
141
142
cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc']
143
fq_files = sorted(glob.glob('*.fq*'))
144
fastq_files = sorted(glob.glob('*.fastq*'))
145
cmd.extend(fq_files+fastq_files)
146
cmd.extend(['-o', outdirpath, '-t', str(num_threads)])
147
fastqc_pipe = subprocess.Popen(cmd)
148
fastqc_pipe.wait()
149
150
coll_writer = arvados.CollectionWriter()
151
coll_writer.write_directory_tree(outdirpath)
152
pdh = coll_writer.finish()
153
154
body = {'output':pdh, 'success':fastqc_pipe.returncode==0, 'progress':1.0}
155
arvados.api('v1').job_tasks().update(uuid=this_task['uuid'], body=body).execute()
156
</pre>