Project

General

Profile

Actions

Writing a Script Calling a Third Party Tool » History » Revision 16

« Previous | Revision 16/22 (diff) | Next »
Sarah Guthrie, 04/07/2016 10:22 PM


Writing a Script Calling a Third Party Tool

Case study: FastQC

  1. Building an environment able to run FastQC
    1. Writing a Dockerfile
    2. Building a docker image from the Dockerfile
    3. Uploading the docker image to an Arvados instance
  2. Writing a crunch script that runs FastQC (in the docker image)
    1. Calling FastQC
    2. Where to place temporary files
    3. Writing output data
  3. Writing a pipeline template to run the crunch script

Writing a Dockerfile

Dockerfiles, as explained by docker:

Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.

This page (https://docs.docker.com/engine/reference/builder/) describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.

Docker has some wonderful documentation for building Dockerfiles which we recommend you look at for instructions on getting the finished product below:

We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them.

Dockerfile that installs FastQC:

FROM arvados/jobs

USER root

RUN apt-get -q update && apt-get -qy install \
  fontconfig \
  openjdk-6-jre-headless \
  perl \
  unzip \
  wget

USER crunch

RUN mkdir /home/crunch/fastqc
RUN cd /home/crunch/fastqc && \
    wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \
    unzip /home/crunch/fastqc/fastqc_v0.11.4.zip

How to build a docker image from a Dockerfile

Once you have a Dockerfile, you can use the docker build command to build the image using the Dockerfile instructions.

docker build -t username/imagename path/to/Dockerfile/

How to upload a docker image to Arvados

Once the docker image is built, you can use the arvados cli (http://doc.arvados.org/sdk/cli/index.html) command arv keep docker to upload the image to an Arvados cluster.

arv keep docker username/imagename

How to call an external tool from a crunch script

We strongly recommend using the subprocess module for calling external tools. If the output is small and written to standard out, using subprocess.check_output will ensure the tool completed successfully and return the standard output.

import subprocess
foo = subprocess.check_output(['echo','foo'])

If the output is big, subprocess.check_call can redirect it to a file while ensuring the tool completed successfully.

import subprocess
with open('foo', 'w') as outfile:
    subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)

FastQC writes to the current output directory or the output directory specified by the -o flag, so we can use subprocess.check_call

import subprocess
import arvados

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file]
subprocess.check_call(cmd)

Where to put temporary files

import arvados

task = arvados.current_task()
tmpdir = task.tmpdir

Inside the code:

import subprocess
import arvados

task = arvados.current_task()
tmpdir = task.tmpdir

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir]
subprocess.check_call(cmd)

How to write data directly to Keep (Using TaskOutputDir)

import arvados
import arvados.crunch

outdir = arvados.crunch.TaskOutputDir()

# Write to outdir.path

arvados.task_set_output(outdir.manifest_text())

Inside the code:

import subprocess
import arvados
import arvados.crunch

outdir = arvados.crunch.TaskOutputDir()

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path]
subprocess.check_call(cmd)

arvados.task_set_output(outdir.manifest_text())

When TaskOutputDir is not the correct choice

  • If the tool writes symbolic links or named pipes, which are not supported by fuse
  • If the I/O access patterns are not performant with fuse
    • This occurs in Tophat, which opens 20 file handles on multiple files that it writes out

Open a collection writer, write files and/or directory trees:

import arvados

collection_writer = arvados.collection.CollectionWriter()
collection_writer.write_file('foo.txt')
collection_writer.write_directory_tree(bar_directory_path)
arvados.task_set_output(collection_writer.finish())

Inside the code:

import subprocess
import arvados
import os

task = arvados.current_task()
tmpdir = task.tmpdir

outdir_path = os.path.join(tmpdir, 'out')
os.mkdir(outdir_path)

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path]
subprocess.check_call(cmd)

collection_writer = arvados.collection.CollectionWriter()
collection_writer.write_file('foo.txt')
collection_writer.write_directory_tree(outdir_path)
arvados.task_set_output(collection_writer.finish())

The final crunch script

import subprocess
import arvados
import arvados.crunch

outdir = arvados.crunch.TaskOutputDir()

#Grab the file path pointing to the file to run fastqc on 
fastq_file = arvados.getjobparam('input_fastq_file')

#Grab the number of threads available
num_threads = multiprocessing.cpu_count()

cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)]
subprocess.check_call(cmd)

arvados.task_set_output(outdir.manifest_text())

Writing a pipeline template to run the crunch script

...

Updated by Sarah Guthrie almost 8 years ago · 16 revisions