Project

General

Profile

Pipeline Optimization » History » Version 5

Bryan Cosca, 04/14/2016 07:10 PM

1 1 Bryan Cosca
h1. Pipeline Optimization
2
3
h2. Crunchstat Summary
4 2 Bryan Cosca
5 4 Bryan Cosca
Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, RAM, and Keep network traffic across the duration of a job.
6 3 Bryan Cosca
7 1 Bryan Cosca
h3. How to install crunchstat-summary
8 2 Bryan Cosca
9 3 Bryan Cosca
<pre>
10
$ git clone https://github.com/curoverse/arvados.git
11
$ cd arvados/tools/crunchstat-summary/
12
$ python setup.py build
13
$ python setup.py install --user
14
</pre>
15
16 1 Bryan Cosca
h3. How to use crunchstat-summary
17
18 3 Bryan Cosca
<pre>
19
$ ./bin/crunchstat-summary --help
20
usage: crunchstat-summary [-h]
21
                          [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE]
22
                          [--skip-child-jobs] [--format {html,text}]
23
                          [--verbose]
24
25
Summarize resource usage of an Arvados Crunch job
26
27
optional arguments:
28
  -h, --help            show this help message and exit
29
  --job UUID            Look up the specified job and read its log data from
30
                        Keep (or from the Arvados event log, if the job is
31
                        still running)
32
  --pipeline-instance UUID
33
                        Summarize each component of the given pipeline
34
                        instance
35
  --log-file LOG_FILE   Read log data from a regular file
36
  --skip-child-jobs     Do not include stats from child jobs
37
  --format {html,text}  Report format
38
  --verbose, -v         Log more information (once for progress, twice for
39
                        debug)
40
</pre>
41
42 1 Bryan Cosca
--text mode
43
using node recommendations, keep cache size
44
45
--html mode
46
check if you're cpu/io bound
47
check if tasks are being weird, i.e. gatk queue case
48 2 Bryan Cosca
49 1 Bryan Cosca
when to pipe and when to write to keep
50 3 Bryan Cosca
in general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound. If you're seeing cpu level off and keep-read or keep-write taking too long, then you're io bound.
51 1 Bryan Cosca
52 3 Bryan Cosca
h3. choosing the right number of jobs
53
54
each job must output a collection, so if you don't want to output a file, then 
55
56
h2. Job Optimization
57 1 Bryan Cosca
58 5 Bryan Cosca
h3. How to optimize the number of tasks when you don't have native multithreading
59
60 1 Bryan Cosca
tools like gatk, blah blah have native multithreading where you pass a -t.
61 5 Bryan Cosca
tools like varscan/freebayes blah blah don't have native multithreading so you need to find a workaround. generally, some tools have a -L --intervals to pass in certain loci to work on. If you have a bed file you can split reads on, then you can create a new task per interval.
62 1 Bryan Cosca
example here
63
64
h3. piping between tools or writing to a tmpdir.
65 5 Bryan Cosca
66
Creating pipes between tools has shown to sometimes be faster than writing/reading from disk. Feel free to pipe your tools together, for example using subprocess.PIPE in the "python subprocess module":https://docs.python.org/2/library/subprocess.html