Project

General

Profile

Port a Pipeline » History » Version 1

Nancy Ouyang, 03/26/2015 08:10 PM
init

1 1 Nancy Ouyang
h1. Port a Pipeline
2
3
Like any other tool, Arvados requires time to learn. Thus, we don't encourage using Arvados for initial development of analysis pipelines or exploratory research on small subsets of data, when each quick-and-dirty iteration takes minutes on a single machine. But for any computationally-intense work, Arvados offers a lot of benefits.
4
5
Okay, sweet, provenance, reproducibility, easily scaling to gigabytes of data and mucho RAM, evaluating existing pipelines like lobSTR quickly.
6
7
But what about if you want to these sweet benefits when running your own pipelines?
8
In other words, how do you** port a pipeline** to Arvados?
9
10
h2. 1. Quick Way
11
12
First, do you just want to parallelize a single bash script?
13
14
Check if you can use arv-run. Take this arv-run example, which searches multiple FASTQ files in parallel, and saves the results to Keep through shell redirection:
15
    [[TODO]]
16
17
h3. 1.1 Install arv-run
18
19
See: http://doc.arvados.org/sdk/python/sdk-python.html and http://doc.arvados.org/user/reference/api-tokens.html, or in short below:
20
<pre>
21
$ sudo apt-get install python-pip python-dev libattr1-dev libfuse-dev pkg-config python-yaml
22
$ sudo pip install --pre arvados-python-client
23
</pre>
24
(Lost? See http://doc.arvados.org/sdk/python/sdk-python.html )
25
26
If you try to use arv run right now, it will complain about some settings your missing. To fix that,
27
28
# Go to su92l.arvadosapi.com
29
# Login with any Google account (you may need to click login a few times if you hit multiple redirects from Google)
30
# Click in the upper right on your account name -> Manage Account
31
# Optional: While you're here, click "send request for shell access", since that will give you shell access to a VM with all of the Arvados tools pre-installed.
32
# Copy the lines of text, something like
33
<pre>
34
HISTIGNORE=$HISTIGNORE:'export ARVADOS_API_TOKEN=*'
35
export ARVADOS_API_TOKEN=sekritlongthing
36
export ARVADOS_API_HOST=su92l.arvadosapi.com
37
unset ARVADOS_API_HOST_INSECURE
38
</pre>
39
# If you want this to persist across reboot, add this to ~/.bashrc or your ~/.bash_profile
40
41
(Lost? See http://doc.arvados.org/user/reference/api-tokens.html )
42
43
h3. 1.2 Submit job to Arvados
44
45
First, check: Does your command work locally?
46
47
    $ grep -H -n GCTACCAAGTTT *.fa
48
49
If so, then submit it to arvados using @arv run@
50
51
    $ arv-run grep -H -n GCTACCAAGTTT \< *.fa \> output.txt
52
53
* This bash command stores the results as output.txt
54
* Note that due to the particulars of grep, Arvados will report this pipeline as failed if grep does not find anything, and no output will appear on Arvados
55
56
Your dataset is uploaded to Arvados if it wasn't on there already (which may take a while if it's a large dataset), your "grep" job is submitted to run on the Arvados cluster, and you get the results in a few minutes (stored inside output.txt in Arvados). If you go to Workbench at su92l, you will see the pipeline running. It may take a few minutes for Arvados to spool up a node, provision it, and run your job. The robots are working hard for you, grab a cup of coffee.
57
58
(Lost? See http://doc.arvados.org/user/topics/arv-run.html )
59
60
h3. 1.3 However
61
62
If your pipeline looks more like [!image crazy graph], arv-run is not powerful enough. Here we gooooo.
63
64
h2. 2. In Short
65
66
**Estimated reading time: 1 hour.**
67
68
You must write a **pipeline template** that describes your pipeline to Arvados.
69
70
h3. 2.1 VM (Virtual Machine) Access
71
72
Note: You'll need the Arvados set of command-line tools to follow along. The easiest way to get started is to get access to a Virtual Machine (VM) with all the tools pre-installed.
73
74
# Go to su92l.arvadosapi.com
75
# Login with google account (you may need to click login a few times, our redirects are not working well)
76
# Click in the upper right on your account name -> Manage Account4. Hit the "Request shell access" button under Manage Account in workbench.
77
78
h3. 2.2 Pipeline Template Example
79
80
Here is what a simple pipeline template looks like, where the output of program A is provided as input to program B. We'll explain what it all means shortly, but first, don't worry -- you'll never be creating a pipeline template from scratch. You'll always copy and modify an existing boilerplate one (yes, a template for the pipeline template! :])
81
82
83
    **pipelinetemplate.json**
84
    {
85
    "name": "Tiny Bash Script",
86
      "components": {
87
        "Create Two Files": {
88
          "script": "run-command",
89
          "script_version": "master",
90
          "repository": "nancy",
91
          "script_parameters": {
92
            "command": [
93
              "$(job.srcdir)/crunch_scripts/createtwofiles.sh"
94
            ]
95
          ,
96
          "runtime_constraints": {
97
            "docker_image": "nancy/cgatools-wormtable"
98
        ,
99
        "Merge Files": {
100
          "script": "run-command",
101
          "script_version": "master",
102
          "repository": "nancy",
103
          "script_parameters": {
104
            "command": [
105
              "$(job.srcdir)/crunch_scripts/mergefiles.sh",
106
              "$(input)"
107
            ],
108
            "input": {
109
              "output_of": "Create Two Files"
110
          ,
111
          "runtime_constraints": {
112
            "docker_image": "nancy/cgatools-wormtable"
113
          
114
h2. 3. simple and sweet port-a-pipeline example
115
116
Okay, let's dig into what's going on.
117
118
h3. 3.1 the setup
119
120
We'll port an artificially simple pipeline which involves just two short bash scripts, described as "A" and "B" below:
121
122
**Script A. Create two files**
123
Input: nothing
124
Output: two files (out1.txt and out2.txt)
125
126
**Script B. Merge two files into a single file**
127
Input: output of step A
128
Output: a single file (output.txt)
129
130
Visually, this looks like [!graph image] (ignore the long strings of gibberish in the rectangles for now).
131
132
Here's what we've explained so far in the pipeline template:
133
134
135
    **pipelinetemplate.json**
136
    {
137
    **"name": "Tiny Bash Script",**
138
      "components": {
139
       **"Create Two Files": {**
140
          "script": "run-command",
141
          "script_version": "master",
142
          "repository": "arvados",
143
          "script_parameters": {
144
            "command": [
145
              "$(job.srcdir)/crunch_scripts/ *createtwofiles.sh* "
146
            ]
147
          ,
148
          "runtime_constraints": {
149
            "docker_image": "nancy/cgatools-wormtable"
150
        ,
151
        **"Merge Files": {**
152
          "script": "run-command",
153
          "script_version": "master",
154
          "repository": "arvados",
155
          "script_parameters": {
156
            "command": [
157
              "$(job.srcdir)/crunch_scripts/ *mergefiles.sh* ",
158
              "$(input)"
159
            ],
160
           **"input": {**
161
              **"output_of": "Create Two Files"**
162
          ,
163
          "runtime_constraints": {
164
            "docker_image": "nancy/cgatools-wormtable"
165
166
167
h3. **3.2 arv-what?**
168
169
Before we go further, let's take a quick step back. Arvados consists of two parts
170
171
**Part 1. Keep** - I have all your files in the cloud!
172
173
You can access your files through your browser, using **Workbench**, or using the Arvados command line (CLI) tools (link: http://doc.arvados.org/sdk/cli/index.html ).
174
175
Visually, this looks like
176
[!image 1: workbench]
177
[!image 2: shell session, arv mount]
178
179
**Part 2. Crunch** - I run all your scripts in the cloud!
180
181
Crunch both dispatches jobs and provides version control for your pipelines.
182
183
You describe your workflow to Crunch using **pipeline templates**. Pipeline templates describe a pipeline ("workflow"), the type of inputs each step in the pipeline requires, and . You provide a high-level description of how data flows through the pipeline—for example, the outputs of programs A and B are provided as input to program C—and let Crunch take care of the details of starting the individual programs at the right time with the inputs you specified.
184
185
[!image 2: complex pipeline]
186
187
Once you save a pipeline template in Arvados, you run it by creating a pipeline instance that lists the specific inputs you’d like to use. The pipeline’s final output(s) will be saved in a project you specify.
188
189
Concretely, a pipeline template describes
190
191
* **data inputs** - specified as Keep content addresses
192
* **job scripts** - stored in a Git version control repository and referenced by a commit hash
193
* **parameters** - which, along with the data inputs, can have default values or can be filled in later when the pipeline is actually run
194
* **the execution environment** - stored in Docker images and referenced by Docker image name
195
196
**What is Docker**? Docker allows Arvados to replicate the execution environment your tools need. You install whatever bioinformatics tools (bwa-mem, vcftools, etc.) you are using inside a Docker image, upload it to Arvados, and Arvados will recreate your environment for computers in the cloud.
197
198
**Protip:** Install stable external tools in Docker. Put your own scripts in a Git repository. This is because each docker image is about 500 GB, so each new docker image takes a while to upload (30 minutes) if you are not using Arvados on a local cluster. In the future, we hope to use small diff files describing just the changes made to Docker image instead of the full Docker image. [Last updated 19 Feb 2015]
199
200
h3. 3.3 In More Detail
201
202
Alright, let's put that all together.
203
204
    **pipelinetemplate.json**
205
    {
206
    "name": "Tiny Bash Script",
207
      "components": {
208
        "Create Two Files": {
209
          "script": "run-command",
210
          "script_version": "master",
211
          "repository": "nancy",
212
          "script_parameters": {
213
            "command": [
214
              "$(job.srcdir)/crunch_scripts/createtwofiles.sh" **#[1]**
215
            ]
216
          ,
217
          "runtime_constraints": {
218
            "docker_image": "nancy/cgatools-wormtable"
219
        ,
220
        "Merge Files": {
221
          "script": "run-command",
222
          "script_version": "master",
223
          "repository": "nancy",
224
          "script_parameters": {
225
            "command": [
226
              "$(job.srcdir)/crunch_scripts/mergefiles.sh", **#[2]**
227
              "$(input)"
228
            ],
229
            "input": {
230
              "output_of": "Create Two Files" **#[3]**
231
          ,
232
          "runtime_constraints": {
233
            "docker_image": "nancy/cgatools-wormtable"    
234
    
235
**Explanation**
236
    
237
[1] **$(job.srcdir)** references the git repository "in the cloud". Even though **run-command** is in nancy/crunch_scripts/ and is "magically found" by Arvados, INSIDE run-command you can't reference other files in the same repo as run-command without this magic variable.
238
239
Any output files as a result of this run-command will be automagically stored to keep as an auto-named collection (which you can think of as a folder for now).
240
241
[2] Okay, so how does the next script know where to find the output of the previous job? run-command will keep track of the collections it's created, so we can feed that in as an argument to our next script. In this "command" section under "run-command", you can think of the commas as spaces. Thus, what this line is saying is "run mergefile.sh on the previous input", or
242
243
  $ mergefiles.sh [directory with output of previous command]
244
245
[3] Here we set the variable "input" to point to the directory with the output of the previous command "Create Two Files".
246
247
(Lost? Try the hands-on example in the next section, or read more detailed documentation on the Arvados website: 
248
249
* http://doc.arvados.org/user/tutorials/running-external-program.html
250
* http://doc.arvados.org/user/topics/run-command.html
251
* http://doc.arvados.org/api/schema/PipelineTemplate.html )
252
253
h3. 3.4 All hands on deck!
254
255
Okay, now that we know the rough shape of what's going on, let's get our hands dirty.
256
257
**From your local machine, login to Arvados virtual machine**
258
259
    nrw@ **nrw-local** $ ssh nancy@lightning-dev4.shell.arvados
260
261
(Lost? See "SSH access to machine with Arvados commandline tools installed" http://doc.arvados.org/user/getting_started/ssh-access-unix.html )
262
263
**In VM, create pipeline template**
264
265
A few steps:
266
267
  nancy@ *@lightning-dev4.su92l@* :~$ arv create pipeline_template
268
Created object qr1hi-p5p6p-3p6uweo7omeq9e7
269
$ arv edit qr1hi-p5p6p-3p6uweo7omeq9e7 #Create the pipeline template as described above! [[Todo: concrete thing]]
270
271
(Lost? See "Writing a pipeline template" http://doc.arvados.org/user/tutorials/running-external-program.html )
272
273
*In VM, set up git repository with run_command and our scripts*
274
275
A few steps:
276
277
  $ mkdir ~/projects
278
$ cd ~/projects
279
*@~/projects@* $ git clone git@git.qr1hi.arvadosapi.com:*@nancy@*.git 
280
echo "Arvados!" > out2.txt
281
$ vi mergefiles.sh
282
283
echo "Arvados!" > out2.txt
284
285
  $ vi mergefiles.sh
286
287
    ⤷$cat mergefiles.sh
288
    #!/bin/bash #[1]
289
    PREVOUTDIR=$1 #[2]
290
    echo $TASK_KEEPMOUNT/by_id/$PREVOUTDIR #[3]
291
    cat $TASK_KEEPMOUNT/by_id/$PREVOUTDIR/*.txt > output.txt
292
    
293
    ⤷ **Explanations**
294
    [1] We use the #! syntax to let bash know what to execute this file with
295
        ⤷To find the location of any particular tool, try using **which**
296
297
        $ which python
298
        /usr/bin/python
299
        $ which bash
300
        /bin/bash
301
302
    [2] [[TODO: $1]] Here we give a human-readable name, PREVOUTDIR, to the first argument given to mergefiles.sh, which (referring back to the pipeline template) we defined as the directory containing the output of the previous job (which ran createtwofiles.sh).
303
304
    [3] Using the environmental variable TASK_KEEPMOUNT allows us to not make assumptions about where **keep** is mounted. TASK_KEEPMOUNT will be replaced by Arvados automatically with the name of the location to which **keep** is mounted on each worker node. (Lost? Visit http://doc.arvados.org/user/tutorials/tutorial-keep-mount.html )
305
    
306
    $ chmod +x createtwofiles.sh mergefiles.sh # make these files executable
307
308
**Commit changes and push to remote**
309
310
    $ git status #check that everything looks ok
311
    $ git add *
312
    $ git commit -m “hello world-of-arvados scripts!”
313
    $ git push
314
315
**Install Docker**
316
317
    $ sudo apt-get install docker.io
318
    $ sudo groupadd docker
319
    $ sudo gpasswd -a $USER docker #in my case, I replace $USER with “nancy”
320
    $ sudo service docker restart
321
    $ exec su -l $USER   #if you don’t want to login+out or spawn a new shell, this will restart your shell
322
323
**Make docker less sad about running out of space on the VM**
324
325
    $ sudo mkdir /data/docker
326
    $ sudo vi /etc/default/docker
327
         ⤷$ cat /etc/default/docker
328
         DOCKER_OPTS="--graph='/data/docker'"
329
         export TMPDIR="/data/docker"
330
     
331
**Make Arvados less sad about running out of space on the VM**
332
    $ sudo mkdir /data/docker-cache
333
    $ sudo chown nancy:nancy /data/docker-cache
334
    $ ln -s /data/docker-cache docker
335
336
**Create Docker image with Arvados command-line tools and other tools we want**
337
    $ docker pull arvados/jobs
338
    $ docker run -ti arvados/jobs /bin/bash
339
or 
340
    root@4fa648c759f3:/# apt-get update 
341
342
    ⤷In the docker image, install external tools that you don't expect to need to update often. 
343
    For instance, we can install the wormtable python tool in this docker image
344
345
    # apt-get install libdb-dev
346
    # pip install wormtable
347
348
        ⤷ Note: If you're installing from binaries, you should either
349
        1) Install in existing places where bash looks for programs (e.g. install in /usr/local/bin/cgatools). To see where bash looks, inspect the PATH variable.
350
        #echo $PATH
351
        /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
352
        
353
354
        2) If you put them in a custom directory, remember them to reference them as such in your scripts (e.g. spell out /home/nrw/local/bin/cgatools). Arvados will not respect modifyng the $PATH variable by using the ~/.bashrc configuration file in the docker image.
355
356
    (Lost? See http://doc.arvados.org/user/topics/arv-docker.html )
357
    
358
    root@4fa648c759f3:/#.  exit
359
360
**Commit Docker image **
361
    $ docker commit 4fa648c759f3 nancy/cgatools-wormtable #Label the image thoughtfully, for instance here I used the name of key tools I installed
362
363
**Upload Docker image from your VM to Keep**
364
    $ arv keep docker nancy/cgatools-wormtable #this takes a few minutes
365
    $ arv keep docker #lists docker images in the cloud, so you can double-check what was uploaded
366
367
**Run this pipeline!**
368
Go to Workbench and hit **Run**.
369
    $ firefox http://su92l.arvadosapi.com
370
[!image: workbench with 'tiny bash script']
371
372
**Note: If no worker nodes are already provisioned, your job may take up to 10 minutes to queue up and start. **Behind-the-scenes, Arvados is requesting compute nodes for you and installing your Docker image and otherwise setting up the environment on those nodes. Then Arvados will be ready to run your job. Be patient -- the wait time may seem frustrating for a trivial pipeline like this, but Arvados really excels at handling long and complicated pipelines with built-in data provenance and pipeline reproducibility.
373
374
h3. 3.5 Celebrate
375
376
Whew! Congratulations on porting your first pipeline to Arvados! Check out http://doc.arvados.org/user/topics/crunch-tools-overview.html to learn more about the different ways to port pipelines to Arvados and how to take full advantage of Arvados's features, like restarting pipelines from where they failed instead of from the beginning. 
377
378
h2. 4. Debugging Tips and Pro-Tips
379
380
h3. **4.1 Pro-tips**
381
382
**Keep mounts are read-only right now. [19 March 2015]**
383
Need to 1) make some temporary directories or 2) change directories away from wherever you started out in but still upload the results to keep?
384
385
For 1, Explicitly use the $HOME directory and make the temporary files there
386
For 2, Use present working directory, $(pwd) at the beginning of your script to write down the directory where run-command will look for files to upload to keep.
387
388
Here's an example:
389
<pre>
390
$ cat mergefiles.sh
391
  TMPDIR=$HOME #directory to make temporary files in
392
  OUTDIR=$(pwd) #directory to put output files in
393
  mkdir $TMPDIR
394
  touch $TMPDIR/sometemporaryfile.txt #this file is deleted when the worker node is stopped
395
  touch $OUTDIR/someoutputfile.txt #this file will be uploaded to keep by run-command
396
</pre>
397
398
* make sure you point to the right repository, your own or arvados.
399
* make sure you pin the script versions of your python sdk, docker image, and script version or you will not get reproducibiltiy.
400
* if you have a file you want to use as a crunch script, make sure its in a crunch_scripts directory. otherwise, arvados will not find it. i.e. ~/path/to/git/repo/crunch_scripts/foo.py
401
402
h3. 4.2 Common log errors and reasons for pipelines to fail
403
404
Todo.
405
406
h3. 4.3 Miscellaneous Notes
407
408
Other ways to avoid the read-only keep mount problem is to use task.vwd which uses symlinks from the output directory which is writable to the colelction in keep. If you can change your working directory to the output directory and do all your work there, you'll avoid the keep read only issue.  (lost? see http://doc.arvados.org/user/topics/run-command.html )
409
    
410
When indexing, i.e. tabix, bwa index, etc. The index file tends to be created in the same directory as your fastq file. In order to avoid this, use ^. There is no way to send the index file to another directory. If you figure out a way, please tell me.
411
412
"bash" "-c" could be your friend, it works sometimes, sometimes it doesnt. I don't have a good handle on why this works sometimes.
413
414
if you're trying to iterate over >1 files using the task.foreach, its important to know that run-command uses a m x n method of making groups. I dont think I can explain it right now, but it may not be exactly what you want and you can trip over it. (lost? see http://doc.arvados.org/user/topics/run-command.html )
415
416
When trying to pair up reads, its hard to use run-command. You have to manipulate basename and hope your file names are foo.1 foo.2. base name will treat the group as foo (because you'll regex the groups as foo) and you can glob for foo.1 and foo.2. but if the file names are foo_1 and foo_2, you cant regex search them for foo becuase you'll get both names into a group and you'll be iterating through both of them twice, because of m x n. 
417
    
418
Your scripts need to point to the right place where the file is. Its currently hard to figure out how to grep the file names, you have to do some magic through the collection api.
419
420
h2. 5. Learn More
421
422
To learn more, head over to the Arvados User Guide documentation online: http://doc.arvados.org/user/