Writing a Script Calling a Third Party Tool » History » Version 22
Sarah Guthrie, 04/08/2016 11:34 PM
1 | 1 | Sarah Guthrie | {{>toc}} |
---|---|---|---|
2 | |||
3 | h1. Writing a Script Calling a Third Party Tool |
||
4 | |||
5 | 16 | Sarah Guthrie | h2. Case study: FastQC |
6 | 1 | Sarah Guthrie | |
7 | 16 | Sarah Guthrie | # Building an environment able to run FastQC |
8 | ## Writing a Dockerfile |
||
9 | ## Building a docker image from the Dockerfile |
||
10 | ## Uploading the docker image to an Arvados instance |
||
11 | # Writing a crunch script that runs FastQC (in the docker image) |
||
12 | ## Calling FastQC |
||
13 | ## Where to place temporary files |
||
14 | ## Writing output data |
||
15 | # Writing a pipeline template to run the crunch script |
||
16 | |||
17 | 1 | Sarah Guthrie | h3. Writing a Dockerfile |
18 | |||
19 | 16 | Sarah Guthrie | Dockerfiles, as explained by docker: |
20 | 1 | Sarah Guthrie | |
21 | 16 | Sarah Guthrie | > Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession. |
22 | > |
||
23 | > This page (https://docs.docker.com/engine/reference/builder/) describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide. |
||
24 | |||
25 | Docker has some wonderful documentation for building Dockerfiles which we recommend you look at for instructions on getting the finished product below: |
||
26 | 14 | Sarah Guthrie | * A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/ |
27 | * Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/ |
||
28 | 1 | Sarah Guthrie | |
29 | 16 | Sarah Guthrie | We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them. |
30 | 15 | Sarah Guthrie | |
31 | 1 | Sarah Guthrie | Dockerfile that installs FastQC: |
32 | <pre> |
||
33 | FROM arvados/jobs |
||
34 | |||
35 | USER root |
||
36 | |||
37 | RUN apt-get -q update && apt-get -qy install \ |
||
38 | fontconfig \ |
||
39 | openjdk-6-jre-headless \ |
||
40 | perl \ |
||
41 | unzip \ |
||
42 | wget |
||
43 | |||
44 | USER crunch |
||
45 | |||
46 | RUN mkdir /home/crunch/fastqc |
||
47 | RUN cd /home/crunch/fastqc && \ |
||
48 | wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \ |
||
49 | unzip /home/crunch/fastqc/fastqc_v0.11.4.zip |
||
50 | </pre> |
||
51 | |||
52 | h3. How to build a docker image from a Dockerfile |
||
53 | |||
54 | 16 | Sarah Guthrie | Once you have a Dockerfile, you can use the @docker build@ command to build the image using the Dockerfile instructions. |
55 | |||
56 | 1 | Sarah Guthrie | <pre> |
57 | docker build -t username/imagename path/to/Dockerfile/ |
||
58 | </pre> |
||
59 | |||
60 | h3. How to upload a docker image to Arvados |
||
61 | |||
62 | 16 | Sarah Guthrie | Once the docker image is built, you can use the arvados cli (http://doc.arvados.org/sdk/cli/index.html) command @arv keep docker@ to upload the image to an Arvados cluster. |
63 | |||
64 | 1 | Sarah Guthrie | <pre> |
65 | 16 | Sarah Guthrie | arv keep docker username/imagename |
66 | 1 | Sarah Guthrie | </pre> |
67 | |||
68 | h3. How to call an external tool from a crunch script |
||
69 | |||
70 | 3 | Sarah Guthrie | We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output. |
71 | 1 | Sarah Guthrie | |
72 | <pre> |
||
73 | import subprocess |
||
74 | 2 | Sarah Guthrie | foo = subprocess.check_output(['echo','foo']) |
75 | 1 | Sarah Guthrie | </pre> |
76 | |||
77 | 3 | Sarah Guthrie | If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully. |
78 | 1 | Sarah Guthrie | |
79 | <pre> |
||
80 | import subprocess |
||
81 | with open('foo', 'w') as outfile: |
||
82 | subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile) |
||
83 | </pre> |
||
84 | 2 | Sarah Guthrie | |
85 | FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@ |
||
86 | |||
87 | <pre> |
||
88 | import subprocess |
||
89 | import arvados |
||
90 | |||
91 | #Grab the file path pointing to the file to run fastqc on |
||
92 | fastq_file = arvados.getjobparam('input_fastq_file') |
||
93 | |||
94 | cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file] |
||
95 | subprocess.check_call(cmd) |
||
96 | </pre> |
||
97 | 1 | Sarah Guthrie | |
98 | h3. Where to put temporary files |
||
99 | 6 | Sarah Guthrie | |
100 | 5 | Sarah Guthrie | <pre> |
101 | import arvados |
||
102 | |||
103 | task = arvados.current_task() |
||
104 | tmpdir = task.tmpdir |
||
105 | </pre> |
||
106 | |||
107 | Inside the code: |
||
108 | 1 | Sarah Guthrie | |
109 | <pre> |
||
110 | 4 | Sarah Guthrie | import subprocess |
111 | 1 | Sarah Guthrie | import arvados |
112 | 4 | Sarah Guthrie | |
113 | 1 | Sarah Guthrie | task = arvados.current_task() |
114 | tmpdir = task.tmpdir |
||
115 | |||
116 | 4 | Sarah Guthrie | #Grab the file path pointing to the file to run fastqc on |
117 | fastq_file = arvados.getjobparam('input_fastq_file') |
||
118 | |||
119 | cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir] |
||
120 | subprocess.check_call(cmd) |
||
121 | 1 | Sarah Guthrie | </pre> |
122 | |||
123 | h3. How to write data directly to Keep (Using TaskOutputDir) |
||
124 | |||
125 | <pre> |
||
126 | 8 | Sarah Guthrie | import arvados |
127 | import arvados.crunch |
||
128 | |||
129 | outdir = arvados.crunch.TaskOutputDir() |
||
130 | |||
131 | # Write to outdir.path |
||
132 | |||
133 | arvados.task_set_output(outdir.manifest_text()) |
||
134 | </pre> |
||
135 | |||
136 | Inside the code: |
||
137 | |||
138 | <pre> |
||
139 | 7 | Sarah Guthrie | import subprocess |
140 | 1 | Sarah Guthrie | import arvados |
141 | import arvados.crunch |
||
142 | 7 | Sarah Guthrie | |
143 | 1 | Sarah Guthrie | outdir = arvados.crunch.TaskOutputDir() |
144 | |||
145 | 7 | Sarah Guthrie | #Grab the file path pointing to the file to run fastqc on |
146 | fastq_file = arvados.getjobparam('input_fastq_file') |
||
147 | |||
148 | cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path] |
||
149 | subprocess.check_call(cmd) |
||
150 | 1 | Sarah Guthrie | |
151 | arvados.task_set_output(outdir.manifest_text()) |
||
152 | </pre> |
||
153 | |||
154 | h3. When TaskOutputDir is not the correct choice |
||
155 | |||
156 | * If the tool writes symbolic links or named pipes, which are not supported by fuse |
||
157 | * If the I/O access patterns are not performant with fuse |
||
158 | ** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out |
||
159 | 9 | Sarah Guthrie | |
160 | Open a collection writer, write files and/or directory trees: |
||
161 | |||
162 | 1 | Sarah Guthrie | <pre> |
163 | import arvados |
||
164 | 9 | Sarah Guthrie | |
165 | collection_writer = arvados.collection.CollectionWriter() |
||
166 | collection_writer.write_file('foo.txt') |
||
167 | collection_writer.write_directory_tree(bar_directory_path) |
||
168 | arvados.task_set_output(collection_writer.finish()) |
||
169 | </pre> |
||
170 | |||
171 | Inside the code: |
||
172 | |||
173 | <pre> |
||
174 | import subprocess |
||
175 | 1 | Sarah Guthrie | import arvados |
176 | 9 | Sarah Guthrie | import os |
177 | 1 | Sarah Guthrie | |
178 | task = arvados.current_task() |
||
179 | tmpdir = task.tmpdir |
||
180 | 9 | Sarah Guthrie | |
181 | outdir_path = os.path.join(tmpdir, 'out') |
||
182 | 1 | Sarah Guthrie | os.mkdir(outdir_path) |
183 | 9 | Sarah Guthrie | |
184 | 1 | Sarah Guthrie | #Grab the file path pointing to the file to run fastqc on |
185 | 9 | Sarah Guthrie | fastq_file = arvados.getjobparam('input_fastq_file') |
186 | |||
187 | cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path] |
||
188 | subprocess.check_call(cmd) |
||
189 | 1 | Sarah Guthrie | |
190 | collection_writer = arvados.collection.CollectionWriter() |
||
191 | 10 | Sarah Guthrie | collection_writer.write_file('foo.txt') |
192 | collection_writer.write_directory_tree(outdir_path) |
||
193 | 1 | Sarah Guthrie | arvados.task_set_output(collection_writer.finish()) |
194 | </pre> |
||
195 | 18 | Sarah Guthrie | |
196 | 16 | Sarah Guthrie | h3. The final crunch script |
197 | 1 | Sarah Guthrie | |
198 | 21 | Sarah Guthrie | *fastqc.py* |
199 | 1 | Sarah Guthrie | <pre> |
200 | import subprocess |
||
201 | 11 | Sarah Guthrie | import arvados |
202 | import arvados.crunch |
||
203 | 1 | Sarah Guthrie | |
204 | outdir = arvados.crunch.TaskOutputDir() |
||
205 | |||
206 | #Grab the file path pointing to the file to run fastqc on |
||
207 | fastq_file = arvados.getjobparam('input_fastq_file') |
||
208 | 11 | Sarah Guthrie | |
209 | 1 | Sarah Guthrie | #Grab the number of threads available |
210 | 11 | Sarah Guthrie | num_threads = multiprocessing.cpu_count() |
211 | |||
212 | 1 | Sarah Guthrie | cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)] |
213 | 11 | Sarah Guthrie | subprocess.check_call(cmd) |
214 | |||
215 | arvados.task_set_output(outdir.manifest_text()) |
||
216 | </pre> |
||
217 | 16 | Sarah Guthrie | |
218 | h3. Writing a pipeline template to run the crunch script |
||
219 | |||
220 | 20 | Sarah Guthrie | Now we need to write a pipeline template that specifies this crunch_script and the docker image we created earlier. Like the Dockerfile, even though Arvados relies on the pipeline template on the API server, keeping the pipeline template in the same repository helps maintain the code and helps ensure changes to the code are reflected in the pipeline template. |
221 | |||
222 | Using the call @arv create pipeline_template@, we can create the following pipeline template. |
||
223 | |||
224 | <pre> |
||
225 | { |
||
226 | "name": "FastQC Pipeline", |
||
227 | "components": { |
||
228 | "Run-FastQC": { |
||
229 | "repository": "repository/name", |
||
230 | "script": "fastqc.py", |
||
231 | "script_version": "master", |
||
232 | "script_parameters": { |
||
233 | "input": { |
||
234 | "dataclass": "Collection", |
||
235 | "required": true, |
||
236 | "title": "Input Paired FASTQ RNA-Seq files" |
||
237 | } |
||
238 | }, |
||
239 | "runtime_constraints": { |
||
240 | "docker_image": "username/imagename", |
||
241 | "max_tasks_per_node": 1 |
||
242 | } |
||
243 | } |
||
244 | } |
||
245 | } |
||
246 | </pre> |
||
247 | |||
248 | For further information about managing a pipeline template, see [[Git_strategy_for_pipeline_development]]. |