Writing a Script Calling a Third Party Tool » History » Revision 4
Revision 3 (Sarah Guthrie, 04/06/2016 07:53 PM) → Revision 4/22 (Sarah Guthrie, 04/06/2016 08:09 PM)
{{>toc}} h1. Writing a Script Calling a Third Party Tool Case study: FastQC Good tips include: * Keep the Dockerfile in the git repository h3. Writing a Dockerfile Docker has some wonderful documentation for building Dockerfiles: * A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/ * Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/ From Docker: """ Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession. This page describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide. """ <pre> FROM arvados/jobs USER root RUN apt-get -q update && apt-get -qy install \ fontconfig \ openjdk-6-jre-headless \ perl \ unzip \ wget USER crunch RUN mkdir /home/crunch/fastqc RUN cd /home/crunch/fastqc && \ wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \ unzip /home/crunch/fastqc/fastqc_v0.11.4.zip </pre> h3. How to build a docker image from a Dockerfile <pre> docker build -t username/imagename path/to/Dockerfile/ </pre> h3. How to upload a docker image to Arvados <pre> arv keep put username/imagename </pre> h3. How to call an external tool from a crunch script We strongly recommend using the @subprocess@ module for calling external tools. If the output is small and written to standard out, using @subprocess.check_output@ will ensure the tool completed successfully and return the standard output. <pre> import subprocess foo = subprocess.check_output(['echo','foo']) </pre> If the output is big, @subprocess.check_call@ can redirect it to a file while ensuring the tool completed successfully. <pre> import subprocess with open('foo', 'w') as outfile: subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile) </pre> FastQC writes to the current output directory or the output directory specified by the @-o@ flag, so we can use @subprocess.check_call@ <pre> import subprocess import arvados #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file] subprocess.check_call(cmd) </pre> h3. Where to put temporary files <pre> import subprocess arvados import arvados os task = arvados.current_task() tmpdir = task.tmpdir #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir] with open(os.path.join(tmpdir, 'foo'), 'w') as out: subprocess.check_call(cmd) </pre> </pre> h3. How to write data directly to Keep (Using TaskOutputDir) <pre> import arvados import arvados.crunch import os outdir = arvados.crunch.TaskOutputDir() with open(os.path.join(outdir.path, 'foo'), 'w') as outfile: subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile) arvados.task_set_output(outdir.manifest_text()) </pre> h3. When TaskOutputDir is not the correct choice * If the tool writes symbolic links or named pipes, which are not supported by fuse * If the I/O access patterns are not performant with fuse ** This occurs in Tophat, which opens 20 file handles on multiple files that it writes out <pre> import arvados import os task = arvados.current_task() tmpdir = task.tmpdir os.mkdir(os.path.join(tmpdir, 'out')) with open(os.path.join(tmpdir, 'out', 'foo.txt'), 'w') as out: subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile) collection_writer = arvados.collection.CollectionWriter() collection_writer.write_file('random_file.txt') collection_writer.write_directory_tree(os.path.join(tmpdir, 'out')) arvados.task_set_output(collection_writer.finish()) </pre> h3. Putting it all together <pre> import subprocess cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc'] fq_files = sorted(glob.glob('*.fq*')) fastq_files = sorted(glob.glob('*.fastq*')) cmd.extend(fq_files+fastq_files) cmd.extend(['-o', outdirpath, '-t', str(num_threads)]) fastqc_pipe = subprocess.Popen(cmd) fastqc_pipe.wait() coll_writer = arvados.CollectionWriter() coll_writer.write_directory_tree(outdirpath) pdh = coll_writer.finish() body = {'output':pdh, 'success':fastqc_pipe.returncode==0, 'progress':1.0} arvados.api('v1').job_tasks().update(uuid=this_task['uuid'], body=body).execute() </pre>