Project

General

Profile

Port a Pipeline » History » Version 42

Brett Smith, 05/13/2016 01:15 PM
Fix references to now-gone `by_id` Keep subdirectory

1 1 Nancy Ouyang
h1. Port a Pipeline
2
3
Like any other tool, Arvados requires time to learn. Thus, we don't encourage using Arvados for initial development of analysis pipelines or exploratory research on small subsets of data, when each quick-and-dirty iteration takes minutes on a single machine. But for any computationally-intense work, Arvados offers a lot of benefits.
4
5 8 Nancy Ouyang
Okay, cool, provenance, reproducibility, easily scaling to gigabytes of data and mucho RAM, evaluating existing pipelines like lobSTR quickly.
6 1 Nancy Ouyang
7
But what about if you want to these benefits when running your own pipelines?
8
In other words, how do you **port a pipeline** to Arvados?
9 35 Anonymous
10
{{toc}}
11 1 Nancy Ouyang
12
h2. 1. Quick Way
13
14
First, do you just want to parallelize a single bash script?
15
16 23 Nancy Ouyang
Check if you can use @arv-run@. Take this @arv-run@ example, which searches multiple FASTQ files in parallel, and saves the results to Keep through shell redirection:
17 9 Nancy Ouyang
18
    $ arv-run grep -H -n GCTACCAAGTTT \< *.fa \> output.txt
19 1 Nancy Ouyang
20 24 Nancy Ouyang
Or this example, which runs a shell script:
21
22 42 Brett Smith
    $ echo 'echo hello world' > hello.sh
23 24 Nancy Ouyang
$ arv-run /bin/sh hello.sh
24
25
(Lost? Check out http://doc.arvados.org/user/topics/arv-run.html)
26
27 1 Nancy Ouyang
h3. 1.1 Install arv-run
28
29 32 Brett Smith
(You can skip this step if you're working on an Arvados shell node.  @arv run@ is already installed and configured for you there.)
30
31 1 Nancy Ouyang
See: http://doc.arvados.org/sdk/python/sdk-python.html and http://doc.arvados.org/user/reference/api-tokens.html, or in short below:
32
<pre>
33 32 Brett Smith
$ sudo apt-get install python-pip python-dev python-yaml
34 1 Nancy Ouyang
$ sudo pip install --pre arvados-python-client
35
</pre>
36
(Lost? See http://doc.arvados.org/sdk/python/sdk-python.html )
37
38
If you try to use arv run right now, it will complain about some settings your missing. To fix that,
39
40 13 Nancy Ouyang
# Go to http://cloud.curoverse.com
41 1 Nancy Ouyang
# Login with any Google account (you may need to click login a few times if you hit multiple redirects from Google)
42
# Click in the upper right on your account name -> Manage Account
43 15 Nancy Ouyang
... !{height:10em}manage_account.png!:manage_account.png
44 1 Nancy Ouyang
# Optional: While you're here, click "send request for shell access", since that will give you shell access to a VM with all of the Arvados tools pre-installed.
45 42 Brett Smith
1) !{height:10em}send_request.png!:send_request.png 2) !{height:10em}request_sent.png!:request_sent.png 3) !{height:10em}access_granted.png!:access_granted.png
46 12 Nancy Ouyang
# Copy the lines of text into your terminal, something like
47 1 Nancy Ouyang
<pre>
48
HISTIGNORE=$HISTIGNORE:'export ARVADOS_API_TOKEN=*'
49
export ARVADOS_API_TOKEN=sekritlongthing
50 12 Nancy Ouyang
export ARVADOS_API_HOST=qr1hi.arvadosapi.com
51 1 Nancy Ouyang
unset ARVADOS_API_HOST_INSECURE
52 17 Nancy Ouyang
</pre> ... !{height:5em}terminal_ssh.png!:terminal_ssh.png
53 22 Nancy Ouyang
# If you want this to persist across reboot, add the above lines to @~/.bashrc@ or your @~/.bash_profile@
54 1 Nancy Ouyang
55
(Lost? See http://doc.arvados.org/user/reference/api-tokens.html )
56
57
h3. 1.2 Submit job to Arvados
58
59
First, check: Does your command work locally?
60
61 25 Nancy Ouyang
    $ grep -H -n TGGAAGT *.fa
62 1 Nancy Ouyang
63 25 Nancy Ouyang
... !{width:20em}grep-fasta.png!:grep-fasta.png
64
65
(If you want to follow along and don't have fasta files -- use the ones here: https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-0o2bt8216d7trrw)
66
67 1 Nancy Ouyang
If so, then submit it to arvados using @arv run@
68
69 25 Nancy Ouyang
    $ arv-run grep -H -n TGGAAGT \< *.fa \> output.txt
70 1 Nancy Ouyang
71 21 Nancy Ouyang
* This bash command stores the results as @output.txt@
72 25 Nancy Ouyang
* Note that due to the particulars of grep, Arvados will report this pipeline as **failed** if grep does not find anything, and no output will appear on Arvados
73 1 Nancy Ouyang
74
Your dataset is uploaded to Arvados if it wasn't on there already (which may take a while if it's a large dataset), your @grep@ job is submitted to run on the Arvados cluster, and you get the results in a few minutes (stored inside @output.txt@ in Arvados). If you go to Workbench at http://cloud.curoverse.com, you will see the pipeline running. It may take a few minutes for Arvados to spool up a node, provision it, and run your job. The robots are working hard for you, grab a cup of coffee.
75
76 21 Nancy Ouyang
(Lost? See http://doc.arvados.org/user/topics/arv-run.html )
77 25 Nancy Ouyang
78 1 Nancy Ouyang
h3. 1.3 However
79
80
If your pipeline looks more like
81
82 31 Nancy Ouyang
... !{width: 50%}https://arvados.org/attachments/download/428/provenance_graph_full.png!:https://arvados.org/attachments/download/428/provenance_graph_full.png
83
... _yes, that is a screenshot of an actual pipeline graph auto-generated by Arvados_
84 1 Nancy Ouyang
85 10 Nancy Ouyang
@arv-run@ is not powerful enough. Here we gooooo.
86 1 Nancy Ouyang
87 31 Nancy Ouyang
88 20 Nancy Ouyang
89 1 Nancy Ouyang
h2. 2. In Short
90
91
**Estimated reading time: 1 hour.**
92
93
You must write a **pipeline template** that describes your pipeline to Arvados.
94
95
h3. 2.1 VM (Virtual Machine) Access
96
97
Note: You'll need the Arvados set of command-line tools to follow along. The easiest way to get started is to get access to a Virtual Machine (VM) with all the tools pre-installed.
98
99 19 Nancy Ouyang
# Go to http://cloud.curoverse.com
100 1 Nancy Ouyang
# Login with google account (you may need to click login a few times, our redirects are not working well)
101 18 Nancy Ouyang
# Click in the upper right on your account name -> Manage Account
102
# Hit the "Request shell access" button under Manage Account in workbench.
103 1 Nancy Ouyang
104
h3. 2.2 Pipeline Template Example
105
106
Here is what a simple pipeline template looks like, where the output of program A is provided as input to program B. We'll explain what it all means shortly, but first, don't worry -- you'll never be creating a pipeline template from scratch. You'll always copy and modify an existing boilerplate one (yes, a template for the pipeline template! :])
107
108
109
    **pipelinetemplate.json**
110
    {
111
    "name": "Tiny Bash Script",
112
      "components": {
113
        "Create Two Files": {
114
          "script": "run-command",
115
          "script_version": "master",
116
          "repository": "nancy",
117
          "script_parameters": {
118
            "command": [
119
              "$(job.srcdir)/crunch_scripts/createtwofiles.sh"
120
            ]
121 39 Joshua Randall
          },
122 1 Nancy Ouyang
          "runtime_constraints": {
123
            "docker_image": "nancy/cgatools-wormtable"
124 39 Joshua Randall
          }
125
        },
126 1 Nancy Ouyang
        "Merge Files": {
127
          "script": "run-command",
128
          "script_version": "master",
129
          "repository": "nancy",
130
          "script_parameters": {
131
            "command": [
132
              "$(job.srcdir)/crunch_scripts/mergefiles.sh",
133
              "$(input)"
134 39 Joshua Randall
              ]
135
            },
136 1 Nancy Ouyang
            "input": {
137
              "output_of": "Create Two Files"
138 39 Joshua Randall
            },
139
            "runtime_constraints": {
140
              "docker_image": "nancy/cgatools-wormtable"
141
            }
142
        }
143
      }
144
    }
145
146 1 Nancy Ouyang
h2. 3. simple and sweet port-a-pipeline example
147
148
Okay, let's dig into what's going on.
149
150
h3. 3.1 the setup
151
152
We'll port an artificially simple pipeline which involves just two short bash scripts, described as "A" and "B" below:
153
154
**Script A. Create two files**
155
Input: nothing
156 26 Nancy Ouyang
Output: two files (@out1.txt@ and @out2.txt@)
157 1 Nancy Ouyang
158
**Script B. Merge two files into a single file**
159
Input: output of step A
160 26 Nancy Ouyang
Output: a single file (@output.txt@)
161 1 Nancy Ouyang
162 27 Nancy Ouyang
Or visually (ignore the long strings of gibberish in the rectangles for now):
163 1 Nancy Ouyang
164 27 Nancy Ouyang
... !{height:30em}choose_inputs-small.png!:choose_inputs-small.png
165 26 Nancy Ouyang
166 1 Nancy Ouyang
167
Here's what we've explained so far in the pipeline template:
168
169
170
    **pipelinetemplate.json**
171
    {
172
    **"name": "Tiny Bash Script",**
173
      "components": {
174
       **"Create Two Files": {**
175
          "script": "run-command",
176
          "script_version": "master",
177 39 Joshua Randall
          "repository": "nancy",
178 1 Nancy Ouyang
          "script_parameters": {
179
            "command": [
180 39 Joshua Randall
              **"$(job.srcdir)/crunch_scripts/createtwofiles.sh"**
181 1 Nancy Ouyang
            ]
182 39 Joshua Randall
          },
183 1 Nancy Ouyang
          "runtime_constraints": {
184
            "docker_image": "nancy/cgatools-wormtable"
185 39 Joshua Randall
          }
186
        },
187 1 Nancy Ouyang
        **"Merge Files": {**
188
          "script": "run-command",
189
          "script_version": "master",
190 39 Joshua Randall
          "repository": "nancy",
191 1 Nancy Ouyang
          "script_parameters": {
192
            "command": [
193 39 Joshua Randall
              **"$(job.srcdir)/crunch_scripts/mergefiles.sh",**
194 1 Nancy Ouyang
              "$(input)"
195 39 Joshua Randall
              ]
196
            },
197 1 Nancy Ouyang
           **"input": {**
198
              **"output_of": "Create Two Files"**
199 39 Joshua Randall
            },
200
            "runtime_constraints": {
201
              "docker_image": "nancy/cgatools-wormtable"
202
            }
203
        }
204
      }
205
    }
206 1 Nancy Ouyang
207
h3. **3.2 arv-what?**
208
209
Before we go further, let's take a quick step back. Arvados consists of two parts
210
211
**Part 1. Keep** - I have all your files in the cloud!
212
213
You can access your files through your browser, using **Workbench**, or using the Arvados command line (CLI) tools (link: http://doc.arvados.org/sdk/cli/index.html ).
214 28 Nancy Ouyang
215
Visually, in Workbench, the built-in Arvados web interface, this looks like
216 1 Nancy Ouyang
... !{height:15em}port-a-pipeline-workbench-collection.png!:port-a-pipeline-workbench-collection.png
217 28 Nancy Ouyang
218
Or via the command-line interface
219
... !{height:10em}CLI-keep.png!:CLI-keep.png
220
221
222 1 Nancy Ouyang
223
**Part 2. Crunch** - I run all your scripts in the cloud!
224
225
Crunch both dispatches jobs and provides version control for your pipelines.
226 38 Joshua Randall
227 42 Brett Smith
You describe your workflow to Crunch using **pipeline templates**. Pipeline templates describe a pipeline ("workflow") by defining a set of pipeline components that represent each step in the workflow. The definition of each component includes the job script to run, the environment (e.g. docker image) in which to run it, its configurable parameters, and the input data that it requires. Input data can be hard coded in a pipeline template to a specific keep content address, can be left to be configured at pipeline instantiation, or can be referenced as the "output_of" another component within the pipeline template. By referencing the input data for one component as the output of another component in the pipeline, a high-level workflow graph is formed which implicitly tells Arvados in which order the components should be run.
228 28 Nancy Ouyang
229
... !{width:20em}provenance_graph_detail.png!:provenance_graph_detail.png
230 1 Nancy Ouyang
... _Each task starts when all its inputs have been created_
231
232
Once you save a pipeline template in Arvados, you run it by creating a pipeline instance that lists the specific inputs you’d like to use. The pipeline’s final output(s) will be saved in a project you specify.
233
234
Concretely, a pipeline template describes
235
236
* **data inputs** - specified as Keep content addresses
237
* **job scripts** - stored in a Git version control repository and referenced by a commit hash
238
* **parameters** - which, along with the data inputs, can have default values or can be filled in later when the pipeline is actually run
239
* **the execution environment** - stored in Docker images and referenced by Docker image name
240 34 Bryan Cosca
241 1 Nancy Ouyang
**What is Docker**? Docker allows Arvados to replicate the execution environment your tools need. You install whatever bioinformatics tools (bwa-mem, vcftools, etc.) you are using inside a Docker image, upload it to Arvados, and Arvados will recreate your environment for computers in the cloud.
242
243
**Protip:** Install stable external tools in Docker. Put your own scripts in a Git repository. This is because each docker image is about 1-5 GB, so each new docker image takes a while to upload (30 minutes) if you are not using Arvados on a local cluster. In the future, we hope to use small diff files describing just the changes made to Docker image instead of the full Docker image. [Last updated 19 Feb 2015]
244
245
h3. 3.3 In More Detail
246
247
Alright, let's put that all together.
248
249
    **pipelinetemplate.json**
250
    {
251
    "name": "Tiny Bash Script",
252
      "components": {
253
        "Create Two Files": {
254
          "script": "run-command",
255
          "script_version": "master",
256
          "repository": "nancy",
257
          "script_parameters": {
258
            "command": [
259
              "$(job.srcdir)/crunch_scripts/createtwofiles.sh" **#[1]**
260
            ]
261 39 Joshua Randall
          },
262 1 Nancy Ouyang
          "runtime_constraints": {
263
            "docker_image": "nancy/cgatools-wormtable"
264 39 Joshua Randall
          }
265
        },
266 1 Nancy Ouyang
        "Merge Files": {
267
          "script": "run-command",
268
          "script_version": "master",
269
          "repository": "nancy",
270
          "script_parameters": {
271
            "command": [
272
              "$(job.srcdir)/crunch_scripts/mergefiles.sh", **#[2]**
273
              "$(input)"
274 39 Joshua Randall
              ]
275
            },
276
           "input": {
277 1 Nancy Ouyang
              "output_of": "Create Two Files" **#[3]**
278 39 Joshua Randall
            },
279
            "runtime_constraints": {
280
              "docker_image": "nancy/cgatools-wormtable"
281
            }
282
        }
283
      }
284
    }
285 42 Brett Smith
286 1 Nancy Ouyang
**Explanation**
287 42 Brett Smith
288 1 Nancy Ouyang
[1] **$(job.srcdir)** references the git repository "in the cloud". Even though **run-command** is in nancy/crunch_scripts/ and is "magically found" by Arvados, INSIDE run-command you can't reference other files in the same repo as run-command without this magic variable.
289
290
Any output files as a result of this run-command will be automagically stored to keep as an auto-named collection (which you can think of as a folder for now).
291
292
[2] Okay, so how does the next script know where to find the output of the previous job? run-command will keep track of the collections it's created, so we can feed that in as an argument to our next script. In this "command" section under "run-command", you can think of the commas as spaces. Thus, what this line is saying is "run mergefile.sh on the previous input", or
293
294
  $ mergefiles.sh [directory with output of previous command]
295
296
[3] Here we set the variable "input" to point to the directory with the output of the previous command "Create Two Files".
297
298 42 Brett Smith
(Lost? Try the hands-on example in the next section, or read more detailed documentation on the Arvados website:
299 1 Nancy Ouyang
300
* http://doc.arvados.org/user/tutorials/running-external-program.html
301
* http://doc.arvados.org/user/topics/run-command.html
302
* http://doc.arvados.org/api/schema/PipelineTemplate.html )
303
304
h3. 3.4 All hands on deck!
305 2 Nancy Ouyang
306 1 Nancy Ouyang
Okay, now that we know the rough shape of what's going on, let's get our hands dirty.
307 4 Nancy Ouyang
308 1 Nancy Ouyang
*From your local machine, login to Arvados virtual machine*
309 4 Nancy Ouyang
310
Single step:
311 1 Nancy Ouyang
312
  nrw@ *@nrw-local@* $ ssh nancy@lightning-dev4.shell.arvados
313
314
(Lost? See "SSH access to machine with Arvados commandline tools installed" http://doc.arvados.org/user/getting_started/ssh-access-unix.html )
315
316
**In VM, create pipeline template**
317 30 Nancy Ouyang
318 1 Nancy Ouyang
A few steps:
319
320
  nancy@ *@lightning-dev4.qr1hi@* :~$ arv create pipeline_template
321
Created object qr1hi-p5p6p-3p6uweo7omeq9e7
322
$ arv edit qr1hi-p5p6p-3p6uweo7omeq9e7 #Create the pipeline template as described above! [[Todo: concrete thing]]
323
324
(Lost? See "Writing a pipeline template" http://doc.arvados.org/user/tutorials/running-external-program.html )
325 33 Brett Smith
326 1 Nancy Ouyang
*In VM, set up git repository with run_command and our scripts*
327 2 Nancy Ouyang
328
A few steps:
329 33 Brett Smith
330 1 Nancy Ouyang
  $ mkdir @~@/projects
331 30 Nancy Ouyang
$ cd @~@/projects
332 1 Nancy Ouyang
~/projects $ git clone git@git.qr1hi.arvadosapi.com:nancy.git
333 2 Nancy Ouyang
334 33 Brett Smith
(Lost? Find your own git URL by going to https://workbench.qr1hi.arvadosapi.com/manage_account )
335 1 Nancy Ouyang
336 2 Nancy Ouyang
    ⤷Copy run_command & its dependencies into this crunch_scripts
337
  $ git clone https://github.com/curoverse/arvados.git
338
339
(Lost? Visit https://github.com/curoverse/arvados )
340
341 40 Joshua Randall
  $ cd ./nancy
342
@~@/projects/nancy$ mkdir crunch_scripts
343
@~@/projects/nancy$ cd crunch_scripts
344
@~@/projects/nancy/crunch_scripts$ cp @~@/projects/arvados/crunch_scripts/run_command . #trailing dot!
345
@~@/projects/nancy/crunch_scripts$ cp -r @~@/projects/arvados/crunch_scripts/crunchutil . #trailing dot!
346 2 Nancy Ouyang
347 40 Joshua Randall
  $ cd @~@/projects/nancy/crunch_scripts
348 2 Nancy Ouyang
349 40 Joshua Randall
  $ vi createtwofiles.sh
350
⤷ $cat createtwofiles.sh
351
#!/bin/bash
352
echo "Hello " > out1.txt
353
echo "Arvados!" > out2.txt
354 5 Nancy Ouyang
355 40 Joshua Randall
  $ vi mergefiles.sh
356
⤷$cat mergefiles.sh
357
#!/bin/bash *#[1]*
358
PREVOUTDIR=$1 *#[2]*
359 42 Brett Smith
echo $TASK_KEEPMOUNT/$PREVOUTDIR *#[3]*
360
cat $TASK_KEEPMOUNT/$PREVOUTDIR/*.txt > output.txt
361 5 Nancy Ouyang
362
⤷ *Explanations*
363 41 Nico César
*[1]* We use the @#!@ syntax to let bash know what to execute this file with. This is called "Shebang":https://en.wikipedia.org/wiki/Shebang_%28Unix%29
364 5 Nancy Ouyang
365
  ⤷To find the location of any particular tool, try using **which**
366 40 Joshua Randall
$ which python
367
/usr/bin/python
368
$ which bash
369
/bin/bash
370 29 Nancy Ouyang
371 33 Brett Smith
*[2]* Here we give a human-readable name, @PREVOUTDIR@, to the first argument (referenced using the dollar-sign syntax ala @$1@), given to @mergefiles.sh@, which (referring back to the pipeline template) we defined as the directory containing the output of the previous job (which ran @createtwofiles.sh@).
372 5 Nancy Ouyang
373 33 Brett Smith
(Lost about @$1@? Google "passing arguments to the bash script").
374 6 Nancy Ouyang
375 1 Nancy Ouyang
*[3]* Using the environmental variable @TASK_KEEPMOUNT@ allows us to not make assumptions about where **keep** is mounted. @TASK_KEEPMOUNT@ will be replaced by Arvados automatically with the name of the location to which **keep** is mounted on each worker node. (Lost? Visit http://doc.arvados.org/user/tutorials/tutorial-keep-mount.html )
376
377
<pre>$ chmod +x createtwofiles.sh mergefiles.sh # make these files executable</pre>
378 33 Brett Smith
379 1 Nancy Ouyang
**Commit changes and push to remote**
380
381 2 Nancy Ouyang
A few steps:
382 1 Nancy Ouyang
383
  $ git status #check that everything looks ok
384 2 Nancy Ouyang
$ git add *
385 33 Brett Smith
$ git commit -m “hello world-of-arvados scripts!”
386 1 Nancy Ouyang
$ git push
387 33 Brett Smith
388 1 Nancy Ouyang
**Create Docker image with Arvados command-line tools and other tools we want**
389
390
> *Note:* This section assumes that you have Docker installed and usable under your user accounts.  However, because users with Docker access can defeat a lot of system security, it's not available on all Arvados shells.  If your Arvados VM doesn't provide you access to Docker, you have two options.  You can ask the site administrator to grant you access; or you can install Docker on your own GNU/Linux workstation, and upload the image to Arvados from there.  To learn how to do that, see the installation guides for "Docker Engine":https://docs.docker.com/ and the "Arvados Python SDK":http://doc.arvados.org/sdk/python/sdk-python.html, which includes the @arv-keepdocker@ tool to upload an image.
391 3 Nancy Ouyang
392 36 Joshua Randall
A few steps:
393 1 Nancy Ouyang
394 40 Joshua Randall
  $ docker pull arvados/jobs
395 2 Nancy Ouyang
$ docker run -ti -u root arvados/jobs /bin/bash
396 33 Brett Smith
397 1 Nancy Ouyang
Now we are in the docker image.
398 33 Brett Smith
399 1 Nancy Ouyang
    root@4fa648c759f3:/# apt-get update
400 4 Nancy Ouyang
401 2 Nancy Ouyang
    @  @⤷In the docker image, install external tools that you don't expect to need to update often.
402 1 Nancy Ouyang
    For instance, we can install the wormtable python tool in this docker image
403
    @  @# apt-get install libdb-dev
404 33 Brett Smith
    @  @# pip install wormtable
405 1 Nancy Ouyang
406 2 Nancy Ouyang
    @  @  ⤷ Note: If you're installing from binaries, you should either
407 3 Nancy Ouyang
        1) Install in existing places where bash looks for programs (e.g. install in /usr/local/bin/cgatools).
408 1 Nancy Ouyang
        To see where bash looks, inspect the PATH variable.
409
          #echo $PATH
410
          /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
411
        2) If you put them in a custom directory, remember them to reference them as such in your scripts
412 6 Nancy Ouyang
        (e.g. spell out /home/nrw/local/bin/cgatools).
413 33 Brett Smith
        Arvados will not respect modifyng the $PATH variable by using the ~/.bashrc configuration file in the docker image.
414 6 Nancy Ouyang
415 2 Nancy Ouyang
(Lost? See http://doc.arvados.org/user/topics/arv-docker.html )
416 1 Nancy Ouyang
417
  root@4fa648c759f3:/# exit
418
419 2 Nancy Ouyang
*Commit Docker image*
420 1 Nancy Ouyang
<pre>
421 2 Nancy Ouyang
$ docker commit 4fa648c759f3 nancy/cgatools-wormtable #Label the image thoughtfully
422 6 Nancy Ouyang
$ #For instance here I used the name of key tools I installed: cgatools & wormtable
423 33 Brett Smith
</pre>
424
425
*Upload Docker image from your VM to Keep*
426 2 Nancy Ouyang
427 33 Brett Smith
> *Note:* @arv-keepdocker@ saves the Docker image in @~/.cache/arvados/docker@ before uploading, so it can resume in case of interruption.  If the @/home@ partition is not big enough to hold the Docker image, you may get strange I/O errors about pipe closed or stdin full.  You can prevent this by making @~/.cache/arvados/docker@ a symlink to another directory you control where enough space is available.  An example command for that might look like: @ln -s /scratch/MYNAME/docker ~/.cache/arvados/docker@
428
429 2 Nancy Ouyang
<pre>
430 1 Nancy Ouyang
$ arv-keepdocker nancy/cgatools-wormtable #this takes a few minutes
431
$ arv-keepdocker #lists docker images in the cloud, so you can double-check what was uploaded </pre>
432
433 30 Nancy Ouyang
**Run this pipeline!**
434 1 Nancy Ouyang
Go to Workbench and hit **Run**.
435
<pre>$ firefox http://qr1hi.arvadosapi.com</pre>
436 33 Brett Smith
[!image: workbench with 'tiny bash script']
437 1 Nancy Ouyang
438
*Note: If no worker nodes are already provisioned, your job may take up to 10 minutes to queue up and start.* Behind-the-scenes, Arvados is requesting compute nodes for you and installing your Docker image and otherwise setting up the environment on those nodes. Then Arvados will be ready to run your job. Be patient -- the wait time may seem frustrating for a trivial pipeline like this, but Arvados really excels at handling long and complicated pipelines with built-in data provenance and pipeline reproducibility.
439
440
441
h3. 3.5 Celebrate
442
443 42 Brett Smith
Whew! Congratulations on porting your first pipeline to Arvados! Check out http://doc.arvados.org/user/topics/crunch-tools-overview.html to learn more about the different ways to port pipelines to Arvados and how to take full advantage of Arvados's features, like restarting pipelines from where they failed instead of from the beginning.
444 1 Nancy Ouyang
445
h2. 4. Debugging Tips and Pro-Tips
446
447
h3. **4.1 Pro-tips**
448
449
**Keep mounts are read-only right now. [19 March 2015]**
450
Need to 1) make some temporary directories or 2) change directories away from wherever you started out in but still upload the results to keep?
451
452
For 1, Explicitly use the $HOME directory and make the temporary files there
453
For 2, Use present working directory, $(pwd) at the beginning of your script to write down the directory where run-command will look for files to upload to keep.
454
455
Here's an example:
456
<pre>
457
$ cat mergefiles.sh
458
  TMPDIR=$HOME #directory to make temporary files in
459
  OUTDIR=$(pwd) #directory to put output files in
460
  mkdir $TMPDIR
461
  touch $TMPDIR/sometemporaryfile.txt #this file is deleted when the worker node is stopped
462
  touch $OUTDIR/someoutputfile.txt #this file will be uploaded to keep by run-command
463
</pre>
464
465
* make sure you point to the right repository, your own or arvados.
466
* make sure you pin the script versions of your python sdk, docker image, and script version or you will not get reproducibiltiy.
467
* if you have a file you want to use as a crunch script, make sure its in a crunch_scripts directory. otherwise, arvados will not find it. i.e. ~/path/to/git/repo/crunch_scripts/foo.py
468
469
h3. 4.2 Common log errors and reasons for pipelines to fail
470
471
Todo.
472
473
h3. 4.3 Miscellaneous Notes
474
475
Other ways to avoid the read-only keep mount problem is to use task.vwd which uses symlinks from the output directory which is writable to the colelction in keep. If you can change your working directory to the output directory and do all your work there, you'll avoid the keep read only issue.  (lost? see http://doc.arvados.org/user/topics/run-command.html )
476 42 Brett Smith
477 1 Nancy Ouyang
When indexing, i.e. tabix, bwa index, etc. The index file tends to be created in the same directory as your fastq file. In order to avoid this, use ^. There is no way to send the index file to another directory. If you figure out a way, please tell me.
478
479
"bash" "-c" could be your friend, it works sometimes, sometimes it doesnt. I don't have a good handle on why this works sometimes.
480
481
if you're trying to iterate over >1 files using the task.foreach, its important to know that run-command uses a m x n method of making groups. I dont think I can explain it right now, but it may not be exactly what you want and you can trip over it. (lost? see http://doc.arvados.org/user/topics/run-command.html )
482
483 42 Brett Smith
When trying to pair up reads, its hard to use run-command. You have to manipulate basename and hope your file names are foo.1 foo.2. base name will treat the group as foo (because you'll regex the groups as foo) and you can glob for foo.1 and foo.2. but if the file names are foo_1 and foo_2, you cant regex search them for foo becuase you'll get both names into a group and you'll be iterating through both of them twice, because of m x n.
484
485 1 Nancy Ouyang
Your scripts need to point to the right place where the file is. Its currently hard to figure out how to grep the file names, you have to do some magic through the collection api.
486
487
h2. 5. Learn More
488
489
To learn more, head over to the Arvados User Guide documentation online: http://doc.arvados.org/user/