


Writing a Script Calling a Third Party Tool » History » Revision 8

Revision 7 (Sarah Guthrie, 04/06/2016 08:17 PM) → Revision 8/22 (Sarah Guthrie, 04/06/2016 08:22 PM)


 h1. Writing a Script Calling a Third Party Tool 

 Case study: FastQC 

 Good tips include: 
 * Keep the Dockerfile in the git repository 

 h3. Writing a Dockerfile 

 Docker has some wonderful documentation for building Dockerfiles: 
 * A reference for Dockerfiles: 
 * Dockerfile best practices: 

 From Docker: 

 Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession. 

 This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices ( for a tip-oriented guide. 

 FROM arvados/jobs 

 USER root 

 RUN apt-get -q update && apt-get -qy install \ 
   fontconfig \ 
   openjdk-6-jre-headless \ 
   perl \ 
   unzip \ 

 USER crunch 

 RUN mkdir /home/crunch/fastqc 
 RUN cd /home/crunch/fastqc && \ 
     wget --quiet && \ 
     unzip /home/crunch/fastqc/ 


 h3. How to build a docker image from a Dockerfile 

 docker build -t username/imagename path/to/Dockerfile/ 

 h3. How to upload a docker image to Arvados 

 arv keep put username/imagename 

 h3. How to call an external tool from a crunch script 

 We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output. 

 import subprocess 
 foo = subprocess.check_output(['echo','foo']) 

 If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully. 

 import subprocess 
 with open('foo', 'w') as outfile: 
     subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile) 

 FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@ 

 import subprocess 
 import arvados 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file] 

 h3. Where to put temporary files 

 import arvados 

 task = arvados.current_task() 
 tmpdir = task.tmpdir 

 Inside the code: 

 import subprocess 
 import arvados 

 task = arvados.current_task() 
 tmpdir = task.tmpdir 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir] 


 h3. How to write data directly to Keep (Using TaskOutputDir) 

 import arvados 
 import arvados.crunch 

 outdir = arvados.crunch.TaskOutputDir() 

 # Write to outdir.path 


 Inside the code: 

 import subprocess 
 import arvados 
 import arvados.crunch 

 outdir = arvados.crunch.TaskOutputDir() 

 #Grab the file path pointing to the file to run fastqc on  
 fastq_file = arvados.getjobparam('input_fastq_file') 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path] 


 h3. When TaskOutputDir is not the correct choice 

 * If the tool writes symbolic links or named pipes, which are not supported by fuse 
 * If the I/O access patterns are not performant with fuse 
 ** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out 
 import arvados 
 import os 
 task = arvados.current_task() 
 tmpdir = task.tmpdir 

 os.mkdir(os.path.join(tmpdir, 'out')) 

 with open(os.path.join(tmpdir, 'out', 'foo.txt'), 'w') as out: 
     subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile) 

 collection_writer = arvados.collection.CollectionWriter() 
 collection_writer.write_directory_tree(os.path.join(tmpdir, 'out')) 


 h3. Putting it all together 

 import subprocess 

 cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc'] 
 fq_files = sorted(glob.glob('*.fq*')) 
 fastq_files = sorted(glob.glob('*.fastq*')) 
 cmd.extend(['-o', outdirpath, '-t', str(num_threads)]) 
 fastqc_pipe = subprocess.Popen(cmd) 

 coll_writer = arvados.CollectionWriter() 
 pdh = coll_writer.finish() 

 body = {'output':pdh, 'success':fastqc_pipe.returncode==0, 'progress':1.0} 
 arvados.api('v1').job_tasks().update(uuid=this_task['uuid'], body=body).execute() 