- Table of contents
- Writing a Script Calling a Third Party Tool
- Case study: FastQC
- Writing a Dockerfile
- How to build a docker image from a Dockerfile
- How to upload a docker image to Arvados
- How to call an external tool from a crunch script
- Where to put temporary files
- How to write data directly to Keep (Using TaskOutputDir)
- When TaskOutputDir is not the correct choice
- The final crunch script
- Writing a pipeline template to run the crunch script
- Case study: FastQC
Writing a Script Calling a Third Party Tool¶
Case study: FastQC¶
- Building an environment able to run FastQC
- Writing a Dockerfile
- Building a docker image from the Dockerfile
- Uploading the docker image to an Arvados instance
- Writing a crunch script that runs FastQC (in the docker image)
- Calling FastQC
- Where to place temporary files
- Writing output data
- Writing a pipeline template to run the crunch script
Writing a Dockerfile¶
Dockerfiles, as explained by docker:
Docker has some wonderful documentation for building Dockerfiles which we recommend you look at for instructions on getting the finished product below:Docker can build images automatically by reading the instructions from a Dockerfile. A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.
This page (https://docs.docker.com/engine/reference/builder/) describes the commands you can use in a Dockerfile. When you are done reading this page, refer to the Dockerfile Best Practices (https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/) for a tip-oriented guide.
- A reference for Dockerfiles: https://docs.docker.com/engine/reference/builder/
- Dockerfile best practices: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/
We strongly recommend keeping your Dockerfiles in the git repository with the crunch scripts that run inside the docker images created by them.
Dockerfile that installs FastQC:
FROM arvados/jobs USER root RUN apt-get -q update && apt-get -qy install \ fontconfig \ openjdk-6-jre-headless \ perl \ unzip \ wget USER crunch RUN mkdir /home/crunch/fastqc RUN cd /home/crunch/fastqc && \ wget --quiet http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.4.zip && \ unzip /home/crunch/fastqc/fastqc_v0.11.4.zip
How to build a docker image from a Dockerfile¶
Once you have a Dockerfile, you can use the docker build
command to build the image using the Dockerfile instructions.
docker build -t username/imagename path/to/Dockerfile/
How to upload a docker image to Arvados¶
Once the docker image is built, you can use the arvados cli (http://doc.arvados.org/sdk/cli/index.html) command arv keep docker
to upload the image to an Arvados cluster.
arv keep docker username/imagename
How to call an external tool from a crunch script¶
We strongly recommend using the subprocess
module for calling external tools. If the output is small and written to standard out, using subprocess.check_output
will ensure the tool completed successfully and return the standard output.
import subprocess foo = subprocess.check_output(['echo','foo'])
If the output is big, subprocess.check_call
can redirect it to a file while ensuring the tool completed successfully.
import subprocess with open('foo', 'w') as outfile: subprocess.check_call(['head', '-c', '1234567', '/dev/urandom'], stdout=outfile)
FastQC writes to the current output directory or the output directory specified by the -o
flag, so we can use subprocess.check_call
import subprocess import arvados #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file] subprocess.check_call(cmd)
Where to put temporary files¶
import arvados task = arvados.current_task() tmpdir = task.tmpdir
Inside the code:
import subprocess import arvados task = arvados.current_task() tmpdir = task.tmpdir #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', tmpdir] subprocess.check_call(cmd)
How to write data directly to Keep (Using TaskOutputDir)¶
import arvados import arvados.crunch outdir = arvados.crunch.TaskOutputDir() # Write to outdir.path arvados.task_set_output(outdir.manifest_text())
Inside the code:
import subprocess import arvados import arvados.crunch outdir = arvados.crunch.TaskOutputDir() #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path] subprocess.check_call(cmd) arvados.task_set_output(outdir.manifest_text())
When TaskOutputDir is not the correct choice¶
- If the tool writes symbolic links or named pipes, which are not supported by fuse
- If the I/O access patterns are not performant with fuse
- This occurs in Tophat, which opens 20 file handles on multiple files that it writes out
Open a collection writer, write files and/or directory trees:
import arvados collection_writer = arvados.collection.CollectionWriter() collection_writer.write_file('foo.txt') collection_writer.write_directory_tree(bar_directory_path) arvados.task_set_output(collection_writer.finish())
Inside the code:
import subprocess import arvados import os task = arvados.current_task() tmpdir = task.tmpdir outdir_path = os.path.join(tmpdir, 'out') os.mkdir(outdir_path) #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir_path] subprocess.check_call(cmd) collection_writer = arvados.collection.CollectionWriter() collection_writer.write_file('foo.txt') collection_writer.write_directory_tree(outdir_path) arvados.task_set_output(collection_writer.finish())
The final crunch script¶
fastqc.py
import subprocess import arvados import arvados.crunch outdir = arvados.crunch.TaskOutputDir() #Grab the file path pointing to the file to run fastqc on fastq_file = arvados.getjobparam('input_fastq_file') #Grab the number of threads available num_threads = multiprocessing.cpu_count() cmd = ['perl', '/home/crunch/fastqc/FastQC/fastqc', fastq_file, '-o', outdir.path, '-t', str(num_threads)] subprocess.check_call(cmd) arvados.task_set_output(outdir.manifest_text())
Writing a pipeline template to run the crunch script¶
Now we need to write a pipeline template that specifies this crunch_script and the docker image we created earlier. Like the Dockerfile, even though Arvados relies on the pipeline template on the API server, keeping the pipeline template in the same repository helps maintain the code and helps ensure changes to the code are reflected in the pipeline template.
Using the call arv create pipeline_template
, we can create the following pipeline template.
{ "name": "FastQC Pipeline", "components": { "Run-FastQC": { "repository": "repository/name", "script": "fastqc.py", "script_version": "master", "script_parameters": { "input": { "dataclass": "Collection", "required": true, "title": "Input Paired FASTQ RNA-Seq files" } }, "runtime_constraints": { "docker_image": "username/imagename", "max_tasks_per_node": 1 } } } }
For further information about managing a pipeline template, see Git_strategy_for_pipeline_development.
Updated by Sarah Guthrie over 8 years ago · 22 revisions