Project

General

Profile

Actions

Port a Pipeline

Like any other tool, Arvados requires time to learn. Thus, we don't encourage using Arvados for initial development of analysis pipelines or exploratory research on small subsets of data, when each quick-and-dirty iteration takes minutes on a single machine. But for any computationally-intense work, Arvados offers a lot of benefits.

Okay, cool, provenance, reproducibility, easily scaling to gigabytes of data and mucho RAM, evaluating existing pipelines like lobSTR quickly.

But what about if you want to these benefits when running your own pipelines?
In other words, how do you port a pipeline to Arvados?

1. Quick Way

First, do you just want to parallelize a single bash script?

Check if you can use arv-run. Take this arv-run example, which searches multiple FASTQ files in parallel, and saves the results to Keep through shell redirection:

$ arv-run grep -H -n GCTACCAAGTTT \< *.fa \> output.txt

Or this example, which runs a shell script:

$ echo 'echo hello world' > hello.sh
$ arv-run /bin/sh hello.sh

(Lost? Check out http://doc.arvados.org/user/topics/arv-run.html)

1.1 Install arv-run

(You can skip this step if you're working on an Arvados shell node. arv run is already installed and configured for you there.)

See: http://doc.arvados.org/sdk/python/sdk-python.html and http://doc.arvados.org/user/reference/api-tokens.html, or in short below:

$ sudo apt-get install python-pip python-dev python-yaml
$ sudo pip install --pre arvados-python-client

(Lost? See http://doc.arvados.org/sdk/python/sdk-python.html )

If you try to use arv run right now, it will complain about some settings your missing. To fix that,

  1. Go to http://cloud.curoverse.com
  2. Login with any Google account (you may need to click login a few times if you hit multiple redirects from Google)
  3. Click in the upper right on your account name -> Manage Account
    ...
  4. Optional: While you're here, click "send request for shell access", since that will give you shell access to a VM with all of the Arvados tools pre-installed.
    1) 2) 3)
  5. Copy the lines of text into your terminal, something like
    HISTIGNORE=$HISTIGNORE:'export ARVADOS_API_TOKEN=*'
    export ARVADOS_API_TOKEN=sekritlongthing
    export ARVADOS_API_HOST=qr1hi.arvadosapi.com
    unset ARVADOS_API_HOST_INSECURE
    
    ...
  6. If you want this to persist across reboot, add the above lines to ~/.bashrc or your ~/.bash_profile

(Lost? See http://doc.arvados.org/user/reference/api-tokens.html )

1.2 Submit job to Arvados

First, check: Does your command work locally?

$ grep -H -n TGGAAGT *.fa

...

(If you want to follow along and don't have fasta files -- use the ones here: https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-0o2bt8216d7trrw)

If so, then submit it to arvados using arv run

$ arv-run grep -H -n TGGAAGT \< *.fa \> output.txt
  • This bash command stores the results as output.txt
  • Note that due to the particulars of grep, Arvados will report this pipeline as failed if grep does not find anything, and no output will appear on Arvados

Your dataset is uploaded to Arvados if it wasn't on there already (which may take a while if it's a large dataset), your grep job is submitted to run on the Arvados cluster, and you get the results in a few minutes (stored inside output.txt in Arvados). If you go to Workbench at http://cloud.curoverse.com, you will see the pipeline running. It may take a few minutes for Arvados to spool up a node, provision it, and run your job. The robots are working hard for you, grab a cup of coffee.

(Lost? See http://doc.arvados.org/user/topics/arv-run.html )

1.3 However

If your pipeline looks more like

...
... yes, that is a screenshot of an actual pipeline graph auto-generated by Arvados

arv-run is not powerful enough. Here we gooooo.

2. In Short

Estimated reading time: 1 hour.

You must write a pipeline template that describes your pipeline to Arvados.

2.1 VM (Virtual Machine) Access

Note: You'll need the Arvados set of command-line tools to follow along. The easiest way to get started is to get access to a Virtual Machine (VM) with all the tools pre-installed.

  1. Go to http://cloud.curoverse.com
  2. Login with google account (you may need to click login a few times, our redirects are not working well)
  3. Click in the upper right on your account name -> Manage Account
  4. Hit the "Request shell access" button under Manage Account in workbench.

2.2 Pipeline Template Example

Here is what a simple pipeline template looks like, where the output of program A is provided as input to program B. We'll explain what it all means shortly, but first, don't worry -- you'll never be creating a pipeline template from scratch. You'll always copy and modify an existing boilerplate one (yes, a template for the pipeline template! :])

pipelinetemplate.json
{
"name": "Tiny Bash Script",
"components": {
"Create Two Files": {
"script": "run-command",
"script_version": "master",
"repository": "nancy",
"script_parameters": {
"command": [
"$(job.srcdir)/crunch_scripts/createtwofiles.sh"
]
},
"runtime_constraints": {
"docker_image": "nancy/cgatools-wormtable"
}
},
"Merge Files": {
"script": "run-command",
"script_version": "master",
"repository": "nancy",
"script_parameters": {
"command": [
"$(job.srcdir)/crunch_scripts/mergefiles.sh",
"$(input)"
]
},
"input": {
"output_of": "Create Two Files"
},
"runtime_constraints": {
"docker_image": "nancy/cgatools-wormtable"
}
}
}
}

3. simple and sweet port-a-pipeline example

Okay, let's dig into what's going on.

3.1 the setup

We'll port an artificially simple pipeline which involves just two short bash scripts, described as "A" and "B" below:

Script A. Create two files
Input: nothing
Output: two files (out1.txt and out2.txt)

Script B. Merge two files into a single file
Input: output of step A
Output: a single file (output.txt)

Or visually (ignore the long strings of gibberish in the rectangles for now):

...

Here's what we've explained so far in the pipeline template:

pipelinetemplate.json
{
"name": "Tiny Bash Script",
"components": {
"Create Two Files": {
"script": "run-command",
"script_version": "master",
"repository": "nancy",
"script_parameters": {
"command": [
"$(job.srcdir)/crunch_scripts/createtwofiles.sh"
]
},
"runtime_constraints": {
"docker_image": "nancy/cgatools-wormtable"
}
},
"Merge Files": {
"script": "run-command",
"script_version": "master",
"repository": "nancy",
"script_parameters": {
"command": [
"$(job.srcdir)/crunch_scripts/mergefiles.sh",
"$(input)"
],
"input": {
"output_of": "Create Two Files"
},
},
"runtime_constraints": {
"docker_image": "nancy/cgatools-wormtable"
}
}
}
}

3.2 arv-what?

Before we go further, let's take a quick step back. Arvados consists of two parts

Part 1. Keep - I have all your files in the cloud!

You can access your files through your browser, using Workbench, or using the Arvados command line (CLI) tools (link: http://doc.arvados.org/sdk/cli/index.html ).

Visually, in Workbench, the built-in Arvados web interface, this looks like
...

Or via the command-line interface
...

Part 2. Crunch - I run all your scripts in the cloud!

Crunch both dispatches jobs and provides version control for your pipelines.

You describe your workflow to Crunch using pipeline templates. Pipeline templates describe a pipeline ("workflow") by defining a set of pipeline components that represent each step in the workflow. The definition of each component includes the job script to run, the environment (e.g. docker image) in which to run it, its configurable parameters, and the input data that it requires. Input data can be hard coded in a pipeline template to a specific keep content address, can be left to be configured at pipeline instantiation, or can be referenced as the "output_of" another component within the pipeline template. By referencing the input data for one component as the output of another component in the pipeline, a high-level workflow graph is formed which implicitly tells Arvados in which order the components should be run.

...
... Each task starts when all its inputs have been created

Once you save a pipeline template in Arvados, you run it by creating a pipeline instance that lists the specific inputs you’d like to use. The pipeline’s final output(s) will be saved in a project you specify.

Concretely, a pipeline template describes

  • data inputs - specified as Keep content addresses
  • job scripts - stored in a Git version control repository and referenced by a commit hash
  • parameters - which, along with the data inputs, can have default values or can be filled in later when the pipeline is actually run
  • the execution environment - stored in Docker images and referenced by Docker image name

What is Docker? Docker allows Arvados to replicate the execution environment your tools need. You install whatever bioinformatics tools (bwa-mem, vcftools, etc.) you are using inside a Docker image, upload it to Arvados, and Arvados will recreate your environment for computers in the cloud.

Protip: Install stable external tools in Docker. Put your own scripts in a Git repository. This is because each docker image is about 1-5 GB, so each new docker image takes a while to upload (30 minutes) if you are not using Arvados on a local cluster. In the future, we hope to use small diff files describing just the changes made to Docker image instead of the full Docker image. [Last updated 19 Feb 2015]

3.3 In More Detail

Alright, let's put that all together.

pipelinetemplate.json
{
"name": "Tiny Bash Script",
"components": {
"Create Two Files": {
"script": "run-command",
"script_version": "master",
"repository": "nancy",
"script_parameters": {
"command": [
"$(job.srcdir)/crunch_scripts/createtwofiles.sh" #[1]
]
},
"runtime_constraints": {
"docker_image": "nancy/cgatools-wormtable"
}
},
"Merge Files": {
"script": "run-command",
"script_version": "master",
"repository": "nancy",
"script_parameters": {
"command": [
"$(job.srcdir)/crunch_scripts/mergefiles.sh", #[2]
"$(input)"
],
"input": {
"output_of": "Create Two Files" #[3]
}
},
"runtime_constraints": {
"docker_image": "nancy/cgatools-wormtable"
}
}
}
}

Explanation

[1] $(job.srcdir) references the git repository "in the cloud". Even though run-command is in nancy/crunch_scripts/ and is "magically found" by Arvados, INSIDE run-command you can't reference other files in the same repo as run-command without this magic variable.

Any output files as a result of this run-command will be automagically stored to keep as an auto-named collection (which you can think of as a folder for now).

[2] Okay, so how does the next script know where to find the output of the previous job? run-command will keep track of the collections it's created, so we can feed that in as an argument to our next script. In this "command" section under "run-command", you can think of the commas as spaces. Thus, what this line is saying is "run mergefile.sh on the previous input", or

$ mergefiles.sh [directory with output of previous command]

[3] Here we set the variable "input" to point to the directory with the output of the previous command "Create Two Files".

(Lost? Try the hands-on example in the next section, or read more detailed documentation on the Arvados website:

3.4 All hands on deck!

Okay, now that we know the rough shape of what's going on, let's get our hands dirty.

From your local machine, login to Arvados virtual machine

Single step:

nrw@ nrw-local $ ssh 

(Lost? See "SSH access to machine with Arvados commandline tools installed" http://doc.arvados.org/user/getting_started/ssh-access-unix.html )

In VM, create pipeline template

A few steps:

nancy@ lightning-dev4.qr1hi :~$ arv create pipeline_template
Created object qr1hi-p5p6p-3p6uweo7omeq9e7
$ arv edit qr1hi-p5p6p-3p6uweo7omeq9e7 #Create the pipeline template as described above! [[Todo: concrete thing]]

(Lost? See "Writing a pipeline template" http://doc.arvados.org/user/tutorials/running-external-program.html )

In VM, set up git repository with run_command and our scripts

A few steps:

$ mkdir ~/projects
$ cd ~/projects
~/projects $ git clone :nancy.git

(Lost? Find your own git URL by going to https://workbench.qr1hi.arvadosapi.com/manage_account )

⤷Copy run_command & its dependencies into this crunch_scripts
$ git clone https://github.com/curoverse/arvados.git

(Lost? Visit https://github.com/curoverse/arvados )

$ cd ./nancy
~/projects/nancy$ mkdir crunch_scripts
~/projects/nancy$ cd crunch_scripts
~/projects/nancy/crunch_scripts$ cp ~/projects/arvados/crunch_scripts/run_command . #trailing dot!
~/projects/nancy/crunch_scripts$ cp -r ~/projects/arvados/crunch_scripts/crunchutil . #trailing dot!
$ cd ~/projects/nancy/crunch_scripts
$ vi createtwofiles.sh
⤷ $cat createtwofiles.sh
#!/bin/bash
echo "Hello " > out1.txt
echo "Arvados!" > out2.txt
$ vi mergefiles.sh
⤷$cat mergefiles.sh
#!/bin/bash #[1]
PREVOUTDIR=$1 #[2]
echo $TASK_KEEPMOUNT/$PREVOUTDIR #[3]
cat $TASK_KEEPMOUNT/$PREVOUTDIR/*.txt > output.txt

Explanations
[1] We use the #! syntax to let bash know what to execute this file with. This is called Shebang

⤷To find the location of any particular tool, try using which
$ which python
/usr/bin/python
$ which bash
/bin/bash

[2] Here we give a human-readable name, PREVOUTDIR, to the first argument (referenced using the dollar-sign syntax ala $1), given to mergefiles.sh, which (referring back to the pipeline template) we defined as the directory containing the output of the previous job (which ran createtwofiles.sh).

(Lost about $1? Google "passing arguments to the bash script").

[3] Using the environmental variable TASK_KEEPMOUNT allows us to not make assumptions about where keep is mounted. TASK_KEEPMOUNT will be replaced by Arvados automatically with the name of the location to which keep is mounted on each worker node. (Lost? Visit http://doc.arvados.org/user/tutorials/tutorial-keep-mount.html )

$ chmod +x createtwofiles.sh mergefiles.sh # make these files executable

Commit changes and push to remote

A few steps:

$ git status #check that everything looks ok
$ git add *
$ git commit -m “hello world-of-arvados scripts!”
$ git push

Create Docker image with Arvados command-line tools and other tools we want

Note: This section assumes that you have Docker installed and usable under your user accounts. However, because users with Docker access can defeat a lot of system security, it's not available on all Arvados shells. If your Arvados VM doesn't provide you access to Docker, you have two options. You can ask the site administrator to grant you access; or you can install Docker on your own GNU/Linux workstation, and upload the image to Arvados from there. To learn how to do that, see the installation guides for Docker Engine and the Arvados Python SDK, which includes the arv-keepdocker tool to upload an image.

A few steps:

$ docker pull arvados/jobs
$ docker run -ti -u root arvados/jobs /bin/bash

Now we are in the docker image.

root@4fa648c759f3:/# apt-get update
  ⤷In the docker image, install external tools that you don't expect to need to update often.
For instance, we can install the wormtable python tool in this docker image
# apt-get install libdb-dev
# pip install wormtable
    ⤷ Note: If you're installing from binaries, you should either
1) Install in existing places where bash looks for programs (e.g. install in /usr/local/bin/cgatools).
To see where bash looks, inspect the PATH variable.
#echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2) If you put them in a custom directory, remember them to reference them as such in your scripts
(e.g. spell out /home/nrw/local/bin/cgatools).
Arvados will not respect modifyng the $PATH variable by using the ~/.bashrc configuration file in the docker image.

(Lost? See http://doc.arvados.org/user/topics/arv-docker.html )

root@4fa648c759f3:/# exit

Commit Docker image

$ docker commit 4fa648c759f3 nancy/cgatools-wormtable #Label the image thoughtfully
$ #For instance here I used the name of key tools I installed: cgatools & wormtable

Upload Docker image from your VM to Keep

Note: arv-keepdocker saves the Docker image in ~/.cache/arvados/docker before uploading, so it can resume in case of interruption. If the /home partition is not big enough to hold the Docker image, you may get strange I/O errors about pipe closed or stdin full. You can prevent this by making ~/.cache/arvados/docker a symlink to another directory you control where enough space is available. An example command for that might look like: ln -s /scratch/MYNAME/docker ~/.cache/arvados/docker

$ arv-keepdocker nancy/cgatools-wormtable #this takes a few minutes
$ arv-keepdocker #lists docker images in the cloud, so you can double-check what was uploaded 

Run this pipeline!
Go to Workbench and hit Run.

$ firefox http://qr1hi.arvadosapi.com

[!image: workbench with 'tiny bash script']

Note: If no worker nodes are already provisioned, your job may take up to 10 minutes to queue up and start. Behind-the-scenes, Arvados is requesting compute nodes for you and installing your Docker image and otherwise setting up the environment on those nodes. Then Arvados will be ready to run your job. Be patient -- the wait time may seem frustrating for a trivial pipeline like this, but Arvados really excels at handling long and complicated pipelines with built-in data provenance and pipeline reproducibility.

3.5 Celebrate

Whew! Congratulations on porting your first pipeline to Arvados! Check out http://doc.arvados.org/user/topics/crunch-tools-overview.html to learn more about the different ways to port pipelines to Arvados and how to take full advantage of Arvados's features, like restarting pipelines from where they failed instead of from the beginning.

4. Debugging Tips and Pro-Tips

4.1 Pro-tips

Keep mounts are read-only right now. [19 March 2015]
Need to 1) make some temporary directories or 2) change directories away from wherever you started out in but still upload the results to keep?

For 1, Explicitly use the $HOME directory and make the temporary files there
For 2, Use present working directory, $(pwd) at the beginning of your script to write down the directory where run-command will look for files to upload to keep.

Here's an example:

$ cat mergefiles.sh
  TMPDIR=$HOME #directory to make temporary files in
  OUTDIR=$(pwd) #directory to put output files in
  mkdir $TMPDIR
  touch $TMPDIR/sometemporaryfile.txt #this file is deleted when the worker node is stopped
  touch $OUTDIR/someoutputfile.txt #this file will be uploaded to keep by run-command

  • make sure you point to the right repository, your own or arvados.
  • make sure you pin the script versions of your python sdk, docker image, and script version or you will not get reproducibiltiy.
  • if you have a file you want to use as a crunch script, make sure its in a crunch_scripts directory. otherwise, arvados will not find it. i.e. ~/path/to/git/repo/crunch_scripts/foo.py

4.2 Common log errors and reasons for pipelines to fail

Todo.

4.3 Miscellaneous Notes

Other ways to avoid the read-only keep mount problem is to use task.vwd which uses symlinks from the output directory which is writable to the colelction in keep. If you can change your working directory to the output directory and do all your work there, you'll avoid the keep read only issue. (lost? see http://doc.arvados.org/user/topics/run-command.html )

When indexing, i.e. tabix, bwa index, etc. The index file tends to be created in the same directory as your fastq file. In order to avoid this, use ^. There is no way to send the index file to another directory. If you figure out a way, please tell me.

"bash" "-c" could be your friend, it works sometimes, sometimes it doesnt. I don't have a good handle on why this works sometimes.

if you're trying to iterate over >1 files using the task.foreach, its important to know that run-command uses a m x n method of making groups. I dont think I can explain it right now, but it may not be exactly what you want and you can trip over it. (lost? see http://doc.arvados.org/user/topics/run-command.html )

When trying to pair up reads, its hard to use run-command. You have to manipulate basename and hope your file names are foo.1 foo.2. base name will treat the group as foo (because you'll regex the groups as foo) and you can glob for foo.1 and foo.2. but if the file names are foo_1 and foo_2, you cant regex search them for foo becuase you'll get both names into a group and you'll be iterating through both of them twice, because of m x n.

Your scripts need to point to the right place where the file is. Its currently hard to figure out how to grep the file names, you have to do some magic through the collection api.

5. Learn More

To learn more, head over to the Arvados User Guide documentation online: http://doc.arvados.org/user/

Updated by Brett Smith over 8 years ago · 43 revisions