Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Revision 13

Revision 12 (Sarah Guthrie, 04/06/2016 08:46 PM) → Revision 13/22 (Sarah Guthrie, 04/06/2016 08:46 PM)

{{>toc}} 

 h1. Writing a Script Calling a Third Party Tool 

 Case study: FastQC 

 Good tips include: 
 * Keep the Dockerfile in the git repository 

 h3. Writing a Dockerfile 

 We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them.  

 Docker has some wonderful documentation for building Dockerfiles: 
 * A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/ 
 * Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/ 

 From Docker: 

 """ 
 Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession. 

 This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide. 
 """ 

 <pre> 
 FROM arvados/jobs 

 USER root 

 RUN apt-get -q update && apt-get -qy install \ 
   fontconfig \ 
   openjdk-6-jre-headless \ 
   perl \ 
   unzip \ 
   wget 

 USER crunch 

 RUN mkdir /home/crunch/fastqc 
 RUN cd /home/crunch/fastqc && \ 
     wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \ 
     unzip /home/crunch/fastqc/fastqc_v0.11.4.zip 

 </pre> 

 

 h3. How to build a docker image from a Dockerfile 

 <pre> 
 docker build -t username/imagename path/to/Dockerfile/ 
 </pre> 

 h3. How to upload a docker image to Arvados 

 <pre> 
 arv keep put username/imagename 
 </pre> 

 h3. How to call an external tool from a crunch script 

 We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output. 

 <pre> 
 import subprocess 
 foo = subprocess.check_output(['echo','foo']) 
 </pre> 

 If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully. 

 <pre> 
 import subprocess 
 with open('foo', 'w') as outfile: 
     subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile) 
 </pre> 

 FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@ 

 <pre> 
 import subprocess 
 import arvados 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file] 
 subprocess.check_call(cmd) 
 </pre> 

 h3. Where to put temporary files 

 <pre> 
 import arvados 

 task = arvados.current_task() 
 tmpdir = task.tmpdir 
 </pre> 

 Inside the code: 

 <pre> 
 import subprocess 
 import arvados 

 task = arvados.current_task() 
 tmpdir = task.tmpdir 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir] 
 subprocess.check_call(cmd) 

 </pre> 

 

 h3. How to write data directly to Keep (Using TaskOutputDir) 

 <pre> 
 import arvados 
 import arvados.crunch 

 outdir = arvados.crunch.TaskOutputDir() 

 # Write to outdir.path 

 arvados.task_set_output(outdir.manifest_text()) 
 </pre> 

 Inside the code: 

 <pre> 
 import subprocess 
 import arvados 
 import arvados.crunch 

 outdir = arvados.crunch.TaskOutputDir() 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path] 
 subprocess.check_call(cmd) 

 arvados.task_set_output(outdir.manifest_text()) 
 </pre> 

 

 h3. When TaskOutputDir is not the correct choice 

 * If the tool writes symbolic links or named pipes, which are not supported by fuse 
 * If the I/O access patterns are not performant with fuse 
 ** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out 

 Open a collection writer, write files and/or directory trees: 

 <pre> 
 import arvados 

 collection_writer = arvados.collection.CollectionWriter() 
 collection_writer.write_file('foo.txt') 
 collection_writer.write_directory_tree(bar_directory_path) 
 arvados.task_set_output(collection_writer.finish()) 
 </pre> 

 Inside the code: 

 <pre> 
 import subprocess 
 import arvados 
 import os 

 task = arvados.current_task() 
 tmpdir = task.tmpdir 

 outdir_path = os.path.join(tmpdir, 'out') 
 os.mkdir(outdir_path) 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path] 
 subprocess.check_call(cmd) 

 collection_writer = arvados.collection.CollectionWriter() 
 collection_writer.write_file('foo.txt') 
 collection_writer.write_directory_tree(outdir_path) 
 arvados.task_set_output(collection_writer.finish()) 

 </pre> 

 

 h3. Putting it all together 

 <pre> 
 import subprocess 
 import arvados 
 import arvados.crunch 

 outdir = arvados.crunch.TaskOutputDir() 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 #Grab the number of threads available 
 num_threads = multiprocessing.cpu_count() 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)] 
 subprocess.check_call(cmd) 

 arvados.task_set_output(outdir.manifest_text()) 
 </pre>