Project

General

Profile

Port a Pipeline » History » Version 31

Nancy Ouyang, 04/17/2015 05:53 PM

1 7 Tom Clegg
{{>toc}}
2
3 1 Nancy Ouyang
h1. Port a Pipeline
4
5
Like any other tool, Arvados requires time to learn. Thus, we don't encourage using Arvados for initial development of analysis pipelines or exploratory research on small subsets of data, when each quick-and-dirty iteration takes minutes on a single machine. But for any computationally-intense work, Arvados offers a lot of benefits.
6
7 8 Nancy Ouyang
Okay, cool, provenance, reproducibility, easily scaling to gigabytes of data and mucho RAM, evaluating existing pipelines like lobSTR quickly.
8 1 Nancy Ouyang
9 8 Nancy Ouyang
But what about if you want to these benefits when running your own pipelines?
10
In other words, how do you **port a pipeline** to Arvados?
11 1 Nancy Ouyang
12
h2. 1. Quick Way
13
14
First, do you just want to parallelize a single bash script?
15
16 23 Nancy Ouyang
Check if you can use @arv-run@. Take this @arv-run@ example, which searches multiple FASTQ files in parallel, and saves the results to Keep through shell redirection:
17 9 Nancy Ouyang
18
    $ arv-run grep -H -n GCTACCAAGTTT \< *.fa \> output.txt
19 1 Nancy Ouyang
20 24 Nancy Ouyang
Or this example, which runs a shell script:
21
22
    $ echo 'echo hello world' > hello.sh    
23
$ arv-run /bin/sh hello.sh
24
25
(Lost? Check out http://doc.arvados.org/user/topics/arv-run.html)
26
27 1 Nancy Ouyang
h3. 1.1 Install arv-run
28
29
See: http://doc.arvados.org/sdk/python/sdk-python.html and http://doc.arvados.org/user/reference/api-tokens.html, or in short below:
30
<pre>
31
$ sudo apt-get install python-pip python-dev libattr1-dev libfuse-dev pkg-config python-yaml
32
$ sudo pip install --pre arvados-python-client
33
</pre>
34
(Lost? See http://doc.arvados.org/sdk/python/sdk-python.html )
35
36
If you try to use arv run right now, it will complain about some settings your missing. To fix that,
37
38 13 Nancy Ouyang
# Go to http://cloud.curoverse.com
39 1 Nancy Ouyang
# Login with any Google account (you may need to click login a few times if you hit multiple redirects from Google)
40
# Click in the upper right on your account name -> Manage Account
41 15 Nancy Ouyang
... !{height:10em}manage_account.png!:manage_account.png
42 1 Nancy Ouyang
# Optional: While you're here, click "send request for shell access", since that will give you shell access to a VM with all of the Arvados tools pre-installed.
43 14 Nancy Ouyang
1) !{height:10em}send_request.png!:send_request.png 2) !{height:10em}request_sent.png!:request_sent.png 3) !{height:10em}access_granted.png!:access_granted.png 
44 12 Nancy Ouyang
# Copy the lines of text into your terminal, something like
45 1 Nancy Ouyang
<pre>
46
HISTIGNORE=$HISTIGNORE:'export ARVADOS_API_TOKEN=*'
47
export ARVADOS_API_TOKEN=sekritlongthing
48 12 Nancy Ouyang
export ARVADOS_API_HOST=qr1hi.arvadosapi.com
49 1 Nancy Ouyang
unset ARVADOS_API_HOST_INSECURE
50 17 Nancy Ouyang
</pre> ... !{height:5em}terminal_ssh.png!:terminal_ssh.png
51 22 Nancy Ouyang
# If you want this to persist across reboot, add the above lines to @~/.bashrc@ or your @~/.bash_profile@
52 1 Nancy Ouyang
53
(Lost? See http://doc.arvados.org/user/reference/api-tokens.html )
54
55
h3. 1.2 Submit job to Arvados
56
57
First, check: Does your command work locally?
58
59 25 Nancy Ouyang
    $ grep -H -n TGGAAGT *.fa
60 1 Nancy Ouyang
61 25 Nancy Ouyang
... !{width:20em}grep-fasta.png!:grep-fasta.png
62
63
(If you want to follow along and don't have fasta files -- use the ones here: https://workbench.qr1hi.arvadosapi.com/collections/qr1hi-4zz18-0o2bt8216d7trrw)
64
65 1 Nancy Ouyang
If so, then submit it to arvados using @arv run@
66
67 25 Nancy Ouyang
    $ arv-run grep -H -n TGGAAGT \< *.fa \> output.txt
68 1 Nancy Ouyang
69 21 Nancy Ouyang
* This bash command stores the results as @output.txt@
70 25 Nancy Ouyang
* Note that due to the particulars of grep, Arvados will report this pipeline as **failed** if grep does not find anything, and no output will appear on Arvados
71 1 Nancy Ouyang
72
Your dataset is uploaded to Arvados if it wasn't on there already (which may take a while if it's a large dataset), your @grep@ job is submitted to run on the Arvados cluster, and you get the results in a few minutes (stored inside @output.txt@ in Arvados). If you go to Workbench at http://cloud.curoverse.com, you will see the pipeline running. It may take a few minutes for Arvados to spool up a node, provision it, and run your job. The robots are working hard for you, grab a cup of coffee.
73
74 21 Nancy Ouyang
(Lost? See http://doc.arvados.org/user/topics/arv-run.html )
75 25 Nancy Ouyang
76 1 Nancy Ouyang
h3. 1.3 However
77
78
If your pipeline looks more like
79
80 31 Nancy Ouyang
... !{width: 50%}https://arvados.org/attachments/download/428/provenance_graph_full.png!:https://arvados.org/attachments/download/428/provenance_graph_full.png
81
... _yes, that is a screenshot of an actual pipeline graph auto-generated by Arvados_
82 1 Nancy Ouyang
83 10 Nancy Ouyang
@arv-run@ is not powerful enough. Here we gooooo.
84 1 Nancy Ouyang
85 31 Nancy Ouyang
86 20 Nancy Ouyang
87 1 Nancy Ouyang
h2. 2. In Short
88
89
**Estimated reading time: 1 hour.**
90
91
You must write a **pipeline template** that describes your pipeline to Arvados.
92
93
h3. 2.1 VM (Virtual Machine) Access
94
95
Note: You'll need the Arvados set of command-line tools to follow along. The easiest way to get started is to get access to a Virtual Machine (VM) with all the tools pre-installed.
96
97 19 Nancy Ouyang
# Go to http://cloud.curoverse.com
98 1 Nancy Ouyang
# Login with google account (you may need to click login a few times, our redirects are not working well)
99 18 Nancy Ouyang
# Click in the upper right on your account name -> Manage Account
100
# Hit the "Request shell access" button under Manage Account in workbench.
101 1 Nancy Ouyang
102
h3. 2.2 Pipeline Template Example
103
104
Here is what a simple pipeline template looks like, where the output of program A is provided as input to program B. We'll explain what it all means shortly, but first, don't worry -- you'll never be creating a pipeline template from scratch. You'll always copy and modify an existing boilerplate one (yes, a template for the pipeline template! :])
105
106
107
    **pipelinetemplate.json**
108
    {
109
    "name": "Tiny Bash Script",
110
      "components": {
111
        "Create Two Files": {
112
          "script": "run-command",
113
          "script_version": "master",
114
          "repository": "nancy",
115
          "script_parameters": {
116
            "command": [
117
              "$(job.srcdir)/crunch_scripts/createtwofiles.sh"
118
            ]
119
          ,
120
          "runtime_constraints": {
121
            "docker_image": "nancy/cgatools-wormtable"
122
        ,
123
        "Merge Files": {
124
          "script": "run-command",
125
          "script_version": "master",
126
          "repository": "nancy",
127
          "script_parameters": {
128
            "command": [
129
              "$(job.srcdir)/crunch_scripts/mergefiles.sh",
130
              "$(input)"
131
            ],
132
            "input": {
133
              "output_of": "Create Two Files"
134
          ,
135
          "runtime_constraints": {
136
            "docker_image": "nancy/cgatools-wormtable"
137
          
138
h2. 3. simple and sweet port-a-pipeline example
139
140
Okay, let's dig into what's going on.
141
142
h3. 3.1 the setup
143
144
We'll port an artificially simple pipeline which involves just two short bash scripts, described as "A" and "B" below:
145
146
**Script A. Create two files**
147
Input: nothing
148 26 Nancy Ouyang
Output: two files (@out1.txt@ and @out2.txt@)
149 1 Nancy Ouyang
150
**Script B. Merge two files into a single file**
151
Input: output of step A
152 26 Nancy Ouyang
Output: a single file (@output.txt@)
153 1 Nancy Ouyang
154 27 Nancy Ouyang
Or visually (ignore the long strings of gibberish in the rectangles for now):
155 1 Nancy Ouyang
156 27 Nancy Ouyang
... !{height:30em}choose_inputs-small.png!:choose_inputs-small.png
157
158 26 Nancy Ouyang
159 1 Nancy Ouyang
Here's what we've explained so far in the pipeline template:
160
161
162
    **pipelinetemplate.json**
163
    {
164
    **"name": "Tiny Bash Script",**
165
      "components": {
166
       **"Create Two Files": {**
167
          "script": "run-command",
168
          "script_version": "master",
169
          "repository": "arvados",
170
          "script_parameters": {
171
            "command": [
172
              "$(job.srcdir)/crunch_scripts/ *createtwofiles.sh* "
173
            ]
174
          ,
175
          "runtime_constraints": {
176
            "docker_image": "nancy/cgatools-wormtable"
177
        ,
178
        **"Merge Files": {**
179
          "script": "run-command",
180
          "script_version": "master",
181
          "repository": "arvados",
182
          "script_parameters": {
183
            "command": [
184
              "$(job.srcdir)/crunch_scripts/ *mergefiles.sh* ",
185
              "$(input)"
186
            ],
187
           **"input": {**
188
              **"output_of": "Create Two Files"**
189
          ,
190
          "runtime_constraints": {
191
            "docker_image": "nancy/cgatools-wormtable"
192
193
h3. **3.2 arv-what?**
194
195
Before we go further, let's take a quick step back. Arvados consists of two parts
196
197
**Part 1. Keep** - I have all your files in the cloud!
198
199
You can access your files through your browser, using **Workbench**, or using the Arvados command line (CLI) tools (link: http://doc.arvados.org/sdk/cli/index.html ).
200
201 28 Nancy Ouyang
Visually, in Workbench, the built-in Arvados web interface, this looks like
202
... !{height:15em}port-a-pipeline-workbench-collection.png!:port-a-pipeline-workbench-collection.png
203 1 Nancy Ouyang
204 28 Nancy Ouyang
Or via the command-line interface
205
... !{height:10em}CLI-keep.png!:CLI-keep.png
206
207
208
209 1 Nancy Ouyang
**Part 2. Crunch** - I run all your scripts in the cloud!
210
211
Crunch both dispatches jobs and provides version control for your pipelines.
212
213
You describe your workflow to Crunch using **pipeline templates**. Pipeline templates describe a pipeline ("workflow"), the type of inputs each step in the pipeline requires, and . You provide a high-level description of how data flows through the pipeline—for example, the outputs of programs A and B are provided as input to program C—and let Crunch take care of the details of starting the individual programs at the right time with the inputs you specified.
214
215 28 Nancy Ouyang
... !{width:20em}provenance_graph_detail.png!:provenance_graph_detail.png
216
... _Each task starts when all its inputs have been created_
217 1 Nancy Ouyang
218
Once you save a pipeline template in Arvados, you run it by creating a pipeline instance that lists the specific inputs you’d like to use. The pipeline’s final output(s) will be saved in a project you specify.
219
220
Concretely, a pipeline template describes
221
222
* **data inputs** - specified as Keep content addresses
223
* **job scripts** - stored in a Git version control repository and referenced by a commit hash
224
* **parameters** - which, along with the data inputs, can have default values or can be filled in later when the pipeline is actually run
225
* **the execution environment** - stored in Docker images and referenced by Docker image name
226
227
**What is Docker**? Docker allows Arvados to replicate the execution environment your tools need. You install whatever bioinformatics tools (bwa-mem, vcftools, etc.) you are using inside a Docker image, upload it to Arvados, and Arvados will recreate your environment for computers in the cloud.
228
229
**Protip:** Install stable external tools in Docker. Put your own scripts in a Git repository. This is because each docker image is about 500 GB, so each new docker image takes a while to upload (30 minutes) if you are not using Arvados on a local cluster. In the future, we hope to use small diff files describing just the changes made to Docker image instead of the full Docker image. [Last updated 19 Feb 2015]
230
231
h3. 3.3 In More Detail
232
233
Alright, let's put that all together.
234
235
    **pipelinetemplate.json**
236
    {
237
    "name": "Tiny Bash Script",
238
      "components": {
239
        "Create Two Files": {
240
          "script": "run-command",
241
          "script_version": "master",
242
          "repository": "nancy",
243
          "script_parameters": {
244
            "command": [
245
              "$(job.srcdir)/crunch_scripts/createtwofiles.sh" **#[1]**
246
            ]
247
          ,
248
          "runtime_constraints": {
249
            "docker_image": "nancy/cgatools-wormtable"
250
        ,
251
        "Merge Files": {
252
          "script": "run-command",
253
          "script_version": "master",
254
          "repository": "nancy",
255
          "script_parameters": {
256
            "command": [
257
              "$(job.srcdir)/crunch_scripts/mergefiles.sh", **#[2]**
258
              "$(input)"
259
            ],
260
            "input": {
261
              "output_of": "Create Two Files" **#[3]**
262
          ,
263
          "runtime_constraints": {
264
            "docker_image": "nancy/cgatools-wormtable"    
265
    
266
**Explanation**
267
    
268
[1] **$(job.srcdir)** references the git repository "in the cloud". Even though **run-command** is in nancy/crunch_scripts/ and is "magically found" by Arvados, INSIDE run-command you can't reference other files in the same repo as run-command without this magic variable.
269
270
Any output files as a result of this run-command will be automagically stored to keep as an auto-named collection (which you can think of as a folder for now).
271
272
[2] Okay, so how does the next script know where to find the output of the previous job? run-command will keep track of the collections it's created, so we can feed that in as an argument to our next script. In this "command" section under "run-command", you can think of the commas as spaces. Thus, what this line is saying is "run mergefile.sh on the previous input", or
273
274
  $ mergefiles.sh [directory with output of previous command]
275
276
[3] Here we set the variable "input" to point to the directory with the output of the previous command "Create Two Files".
277
278
(Lost? Try the hands-on example in the next section, or read more detailed documentation on the Arvados website: 
279
280
* http://doc.arvados.org/user/tutorials/running-external-program.html
281
* http://doc.arvados.org/user/topics/run-command.html
282
* http://doc.arvados.org/api/schema/PipelineTemplate.html )
283
284
h3. 3.4 All hands on deck!
285
286
Okay, now that we know the rough shape of what's going on, let's get our hands dirty.
287
288 2 Nancy Ouyang
*From your local machine, login to Arvados virtual machine*
289 1 Nancy Ouyang
290 4 Nancy Ouyang
Single step:
291 1 Nancy Ouyang
292 4 Nancy Ouyang
  nrw@ *@nrw-local@* $ ssh nancy@lightning-dev4.shell.arvados
293
294 1 Nancy Ouyang
(Lost? See "SSH access to machine with Arvados commandline tools installed" http://doc.arvados.org/user/getting_started/ssh-access-unix.html )
295
296
**In VM, create pipeline template**
297
298
A few steps:
299
300 30 Nancy Ouyang
  nancy@ *@lightning-dev4.qr1hi@* :~$ arv create pipeline_template
301 1 Nancy Ouyang
Created object qr1hi-p5p6p-3p6uweo7omeq9e7
302
$ arv edit qr1hi-p5p6p-3p6uweo7omeq9e7 #Create the pipeline template as described above! [[Todo: concrete thing]]
303
304
(Lost? See "Writing a pipeline template" http://doc.arvados.org/user/tutorials/running-external-program.html )
305
306
*In VM, set up git repository with run_command and our scripts*
307
308 2 Nancy Ouyang
A few steps: 
309 1 Nancy Ouyang
310 2 Nancy Ouyang
  $ mkdir @~@/projects
311
$ cd @~@/projects
312
~/projects $ git clone git@git.qr1hi.arvadosapi.com:nancy.git 
313 1 Nancy Ouyang
314 30 Nancy Ouyang
(Lost? Find your own git URL by going to https://workbench.qr1hi.arvadosapi.com/manage_account )
315 1 Nancy Ouyang
316 2 Nancy Ouyang
    ⤷Copy run_command & its dependencies into this crunch_scripts
317
  $ git clone https://github.com/curoverse/arvados.git 
318 1 Nancy Ouyang
319 2 Nancy Ouyang
(Lost? Visit https://github.com/curoverse/arvados )
320
321
  @  @$ cd ./nancy
322
  *@~/projects/nancy@* $ mkdir crunch_scripts
323
  *@~/projects/nancy@* $ cd crunch_scripts
324
  *@~/projects/nancy/crunch_scripts@* $ cp @~@/projects/arvados/crunch_scripts/run_command . #trailing dot!
325
  ~/projects/nancy/crunch_scripts$ cp ~/projects/arvados/crunch_scripts/crunchutil . #trailing dot!
326
327
  @  @$ cd ~/projects/nancy/crunch_scripts
328
329
  @  @$ vi createtwofiles.sh
330
    ⤷ $cat createtwofiles.sh
331
    #!/bin/bash
332
    echo "Hello " > out1.txt
333
    echo "Arvados!" > out2.txt
334
335
  @  @$ vi mergefiles.sh
336 1 Nancy Ouyang
    ⤷$cat mergefiles.sh
337 5 Nancy Ouyang
      #!/bin/bash *#[1]*
338
      PREVOUTDIR=$1 *#[2]*
339
      echo $TASK_KEEPMOUNT/by_id/$PREVOUTDIR *#[3]*
340 1 Nancy Ouyang
      cat $TASK_KEEPMOUNT/by_id/$PREVOUTDIR/*.txt > output.txt
341
    
342 5 Nancy Ouyang
⤷ *Explanations*
343 6 Nancy Ouyang
*[1]* We use the @#!@ syntax to let bash know what to execute this file with
344 5 Nancy Ouyang
345
  ⤷To find the location of any particular tool, try using **which**
346
    $ which python
347
    /usr/bin/python
348
    $ which bash
349
    /bin/bash
350
    
351 29 Nancy Ouyang
*[2]* Here we give a human-readable name, @PREVOUTDIR@, to the first argument (referenced using the dollar-sign syntax ala @$1@), given to @mergefiles.sh@, which (referring back to the pipeline template) we defined as the directory containing the output of the previous job (which ran @createtwofiles.sh@).
352
353
(Lost about @$1@? Google "passing arguments to the bash script").
354 5 Nancy Ouyang
    
355
*[3]* Using the environmental variable @TASK_KEEPMOUNT@ allows us to not make assumptions about where **keep** is mounted. @TASK_KEEPMOUNT@ will be replaced by Arvados automatically with the name of the location to which **keep** is mounted on each worker node. (Lost? Visit http://doc.arvados.org/user/tutorials/tutorial-keep-mount.html )
356 1 Nancy Ouyang
    
357 6 Nancy Ouyang
<pre>$ chmod +x createtwofiles.sh mergefiles.sh # make these files executable</pre>
358 1 Nancy Ouyang
359
**Commit changes and push to remote**
360
361 2 Nancy Ouyang
A few steps: 
362 1 Nancy Ouyang
363 2 Nancy Ouyang
  $ git status #check that everything looks ok
364
$ git add *
365
$ git commit -m “hello world-of-arvados scripts!”
366
$ git push
367
368 1 Nancy Ouyang
**Install Docker**
369
370 2 Nancy Ouyang
A few steps: 
371 1 Nancy Ouyang
372 2 Nancy Ouyang
  $ sudo apt-get install docker.io
373
$ sudo groupadd docker
374
$ sudo gpasswd -a $USER docker #in my case, I replace $USER with “nancy”
375
$ sudo service docker restart
376
$ exec su -l $USER   #if you don’t want to login+out or spawn a new shell, this will restart your shell
377
378 1 Nancy Ouyang
**Make docker less sad about running out of space on the VM**
379
380 2 Nancy Ouyang
A few steps:
381
382
  $ sudo mkdir /data/docker
383
$ sudo vi /etc/default/docker
384
@  @⤷$ cat /etc/default/docker
385
      DOCKER_OPTS="--graph='/data/docker'"
386
      export TMPDIR="/data/docker"
387 1 Nancy Ouyang
     
388
**Make Arvados less sad about running out of space on the VM**
389 2 Nancy Ouyang
390
A few steps: 
391
392 1 Nancy Ouyang
    $ sudo mkdir /data/docker-cache
393 3 Nancy Ouyang
$ sudo chown nancy:nancy /data/docker-cache
394
$ ln -s /data/docker-cache docker
395 1 Nancy Ouyang
396
**Create Docker image with Arvados command-line tools and other tools we want**
397 2 Nancy Ouyang
398
A few steps: 
399
400 3 Nancy Ouyang
    $ docker pull arvados/jobs
401 1 Nancy Ouyang
$ docker run -ti arvados/jobs /bin/bash
402 2 Nancy Ouyang
403 4 Nancy Ouyang
Now we are in the docker image.
404 2 Nancy Ouyang
405 1 Nancy Ouyang
    root@4fa648c759f3:/# apt-get update 
406
407 2 Nancy Ouyang
    @  @⤷In the docker image, install external tools that you don't expect to need to update often. 
408 1 Nancy Ouyang
    For instance, we can install the wormtable python tool in this docker image
409 2 Nancy Ouyang
    @  @# apt-get install libdb-dev
410 3 Nancy Ouyang
    @  @# pip install wormtable
411 1 Nancy Ouyang
412
    @  @  ⤷ Note: If you're installing from binaries, you should either
413 6 Nancy Ouyang
        1) Install in existing places where bash looks for programs (e.g. install in /usr/local/bin/cgatools). 
414
        To see where bash looks, inspect the PATH variable.
415 1 Nancy Ouyang
          #echo $PATH
416
          /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
417 6 Nancy Ouyang
        2) If you put them in a custom directory, remember them to reference them as such in your scripts
418
        (e.g. spell out /home/nrw/local/bin/cgatools).
419
        Arvados will not respect modifyng the $PATH variable by using the ~/.bashrc configuration file in the docker image.
420 2 Nancy Ouyang
421 1 Nancy Ouyang
(Lost? See http://doc.arvados.org/user/topics/arv-docker.html )
422
    
423
  root@4fa648c759f3:/# exit
424 2 Nancy Ouyang
425 1 Nancy Ouyang
*Commit Docker image*
426 2 Nancy Ouyang
<pre>
427 6 Nancy Ouyang
$ docker commit 4fa648c759f3 nancy/cgatools-wormtable #Label the image thoughtfully
428
$ #For instance here I used the name of key tools I installed: cgatools & wormtable
429 2 Nancy Ouyang
</pre>
430 1 Nancy Ouyang
431 2 Nancy Ouyang
*Upload Docker image from your VM to Keep*
432
<pre>
433
$ arv keep docker nancy/cgatools-wormtable #this takes a few minutes
434
$ arv keep docker #lists docker images in the cloud, so you can double-check what was uploaded </pre>
435 1 Nancy Ouyang
436
**Run this pipeline!**
437
Go to Workbench and hit **Run**.
438 30 Nancy Ouyang
<pre>$ firefox http://qr1hi.arvadosapi.com</pre>
439 1 Nancy Ouyang
[!image: workbench with 'tiny bash script']
440
441 2 Nancy Ouyang
*Note: If no worker nodes are already provisioned, your job may take up to 10 minutes to queue up and start.* Behind-the-scenes, Arvados is requesting compute nodes for you and installing your Docker image and otherwise setting up the environment on those nodes. Then Arvados will be ready to run your job. Be patient -- the wait time may seem frustrating for a trivial pipeline like this, but Arvados really excels at handling long and complicated pipelines with built-in data provenance and pipeline reproducibility.
442 1 Nancy Ouyang
443
h3. 3.5 Celebrate
444
445
Whew! Congratulations on porting your first pipeline to Arvados! Check out http://doc.arvados.org/user/topics/crunch-tools-overview.html to learn more about the different ways to port pipelines to Arvados and how to take full advantage of Arvados's features, like restarting pipelines from where they failed instead of from the beginning. 
446
447
h2. 4. Debugging Tips and Pro-Tips
448
449
h3. **4.1 Pro-tips**
450
451
**Keep mounts are read-only right now. [19 March 2015]**
452
Need to 1) make some temporary directories or 2) change directories away from wherever you started out in but still upload the results to keep?
453
454
For 1, Explicitly use the $HOME directory and make the temporary files there
455
For 2, Use present working directory, $(pwd) at the beginning of your script to write down the directory where run-command will look for files to upload to keep.
456
457
Here's an example:
458
<pre>
459
$ cat mergefiles.sh
460
  TMPDIR=$HOME #directory to make temporary files in
461
  OUTDIR=$(pwd) #directory to put output files in
462
  mkdir $TMPDIR
463
  touch $TMPDIR/sometemporaryfile.txt #this file is deleted when the worker node is stopped
464
  touch $OUTDIR/someoutputfile.txt #this file will be uploaded to keep by run-command
465
</pre>
466
467
* make sure you point to the right repository, your own or arvados.
468
* make sure you pin the script versions of your python sdk, docker image, and script version or you will not get reproducibiltiy.
469
* if you have a file you want to use as a crunch script, make sure its in a crunch_scripts directory. otherwise, arvados will not find it. i.e. ~/path/to/git/repo/crunch_scripts/foo.py
470
471
h3. 4.2 Common log errors and reasons for pipelines to fail
472
473
Todo.
474
475
h3. 4.3 Miscellaneous Notes
476
477
Other ways to avoid the read-only keep mount problem is to use task.vwd which uses symlinks from the output directory which is writable to the colelction in keep. If you can change your working directory to the output directory and do all your work there, you'll avoid the keep read only issue.  (lost? see http://doc.arvados.org/user/topics/run-command.html )
478
    
479
When indexing, i.e. tabix, bwa index, etc. The index file tends to be created in the same directory as your fastq file. In order to avoid this, use ^. There is no way to send the index file to another directory. If you figure out a way, please tell me.
480
481
"bash" "-c" could be your friend, it works sometimes, sometimes it doesnt. I don't have a good handle on why this works sometimes.
482
483
if you're trying to iterate over >1 files using the task.foreach, its important to know that run-command uses a m x n method of making groups. I dont think I can explain it right now, but it may not be exactly what you want and you can trip over it. (lost? see http://doc.arvados.org/user/topics/run-command.html )
484
485
When trying to pair up reads, its hard to use run-command. You have to manipulate basename and hope your file names are foo.1 foo.2. base name will treat the group as foo (because you'll regex the groups as foo) and you can glob for foo.1 and foo.2. but if the file names are foo_1 and foo_2, you cant regex search them for foo becuase you'll get both names into a group and you'll be iterating through both of them twice, because of m x n. 
486
    
487
Your scripts need to point to the right place where the file is. Its currently hard to figure out how to grep the file names, you have to do some magic through the collection api.
488
489
h2. 5. Learn More
490
491
To learn more, head over to the Arvados User Guide documentation online: http://doc.arvados.org/user/