Project

General

Profile

Writing a Script Calling a Third Party Tool » History » Revision 16

Revision 15 (Sarah Guthrie, 04/06/2016 08:49 PM) → Revision 16/22 (Sarah Guthrie, 04/07/2016 10:22 PM)

{{>toc}} 

 h1. Writing a Script Calling a Third Party Tool 

 h2. Case study: FastQC 

 # Building an environment able to run FastQC 
 ## 

 h3. Writing a Dockerfile  
 ## Building a docker image from 

 We strongly recommend keeping your Dockerfiles in the Dockerfile 
 ## Uploading git repository with the docker image to an Arvados instance 
 # Writing a crunch script scripts that runs FastQC (in run inside the docker image) images created by them.  

 Docker has some wonderful documentation for building Dockerfiles: 
 ## Calling FastQC * A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/ 
 ## Where to place temporary files 
 ## Writing output data 
 # Writing a pipeline template to run the crunch script 

 h3. Writing a * Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/ 

 Dockerfiles, as explained by Explanation about Dockerfiles from docker: 

 > Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession. 
 >  
 > This page (https://docs.docker.com/engine/reference/builder/) describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide. 

 Docker has some wonderful documentation for building Dockerfiles which we recommend you look at for instructions on getting the finished product below: 
 * A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/ 
 * Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/ 

 We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them. >  

 Dockerfile that installs FastQC: 
 

 <pre> 
 FROM arvados/jobs 

 USER root 

 RUN apt-get -q update && apt-get -qy install \ 
   fontconfig \ 
   openjdk-6-jre-headless \ 
   perl \ 
   unzip \ 
   wget 

 USER crunch 

 RUN mkdir /home/crunch/fastqc 
 RUN cd /home/crunch/fastqc && \ 
     wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \ 
     unzip /home/crunch/fastqc/fastqc_v0.11.4.zip 
 

 </pre> 

 

 h3. How to build a docker image from a Dockerfile 

 Once you have a Dockerfile, you can use the @docker build@ command to build the image using the Dockerfile instructions. 

 <pre> 
 docker build -t username/imagename path/to/Dockerfile/ 
 </pre> 

 h3. How to upload a docker image to Arvados 

 Once the docker image is built, you can use the arvados cli (http://doc.arvados.org/sdk/cli/index.html) command @arv keep docker@ to upload the image to an Arvados cluster. 

 <pre> 
 arv keep docker put username/imagename 
 </pre> 

 h3. How to call an external tool from a crunch script 

 We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output. 

 <pre> 
 import subprocess 
 foo = subprocess.check_output(['echo','foo']) 
 </pre> 

 If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully. 

 <pre> 
 import subprocess 
 with open('foo', 'w') as outfile: 
     subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile) 
 </pre> 

 FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@ 

 <pre> 
 import subprocess 
 import arvados 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file] 
 subprocess.check_call(cmd) 
 </pre> 

 h3. Where to put temporary files 

 <pre> 
 import arvados 

 task = arvados.current_task() 
 tmpdir = task.tmpdir 
 </pre> 

 Inside the code: 

 <pre> 
 import subprocess 
 import arvados 

 task = arvados.current_task() 
 tmpdir = task.tmpdir 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir] 
 subprocess.check_call(cmd) 

 </pre> 

 h3. How to write data directly to Keep (Using TaskOutputDir) 

 <pre> 
 import arvados 
 import arvados.crunch 

 outdir = arvados.crunch.TaskOutputDir() 

 # Write to outdir.path 

 arvados.task_set_output(outdir.manifest_text()) 
 </pre> 

 Inside the code: 

 <pre> 
 import subprocess 
 import arvados 
 import arvados.crunch 

 outdir = arvados.crunch.TaskOutputDir() 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path] 
 subprocess.check_call(cmd) 

 arvados.task_set_output(outdir.manifest_text()) 
 </pre> 

 h3. When TaskOutputDir is not the correct choice 

 * If the tool writes symbolic links or named pipes, which are not supported by fuse 
 * If the I/O access patterns are not performant with fuse 
 ** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out 

 Open a collection writer, write files and/or directory trees: 

 <pre> 
 import arvados 

 collection_writer = arvados.collection.CollectionWriter() 
 collection_writer.write_file('foo.txt') 
 collection_writer.write_directory_tree(bar_directory_path) 
 arvados.task_set_output(collection_writer.finish()) 
 </pre> 

 Inside the code: 

 <pre> 
 import subprocess 
 import arvados 
 import os 

 task = arvados.current_task() 
 tmpdir = task.tmpdir 

 outdir_path = os.path.join(tmpdir, 'out') 
 os.mkdir(outdir_path) 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path] 
 subprocess.check_call(cmd) 

 collection_writer = arvados.collection.CollectionWriter() 
 collection_writer.write_file('foo.txt') 
 collection_writer.write_directory_tree(outdir_path) 
 arvados.task_set_output(collection_writer.finish()) 

 </pre> 

 h3. The final crunch script Putting it all together 

 <pre> 
 import subprocess 
 import arvados 
 import arvados.crunch 

 outdir = arvados.crunch.TaskOutputDir() 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 #Grab the number of threads available 
 num_threads = multiprocessing.cpu_count() 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)] 
 subprocess.check_call(cmd) 

 arvados.task_set_output(outdir.manifest_text()) 
 </pre> 

 h3. Writing a pipeline template to run the crunch script 

 ...