Project

General

Profile

Pipeline Optimization » History » Version 3

Bryan Cosca, 04/14/2016 06:03 PM

1 1 Bryan Cosca
h1. Pipeline Optimization
2
3
h2. Crunchstat Summary
4 2 Bryan Cosca
5 3 Bryan Cosca
Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, and Keep network traffic across the duration of a job.
6
7 1 Bryan Cosca
h3. How to install crunchstat-summary
8 2 Bryan Cosca
9 3 Bryan Cosca
<pre>
10
$ git clone https://github.com/curoverse/arvados.git
11
$ cd arvados/tools/crunchstat-summary/
12
$ python setup.py build
13
$ python setup.py install --user
14
</pre>
15
16 1 Bryan Cosca
h3. How to use crunchstat-summary
17
18 3 Bryan Cosca
<pre>
19
$ ./bin/crunchstat-summary --help
20
usage: crunchstat-summary [-h]
21
                          [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE]
22
                          [--skip-child-jobs] [--format {html,text}]
23
                          [--verbose]
24
25
Summarize resource usage of an Arvados Crunch job
26
27
optional arguments:
28
  -h, --help            show this help message and exit
29
  --job UUID            Look up the specified job and read its log data from
30
                        Keep (or from the Arvados event log, if the job is
31
                        still running)
32
  --pipeline-instance UUID
33
                        Summarize each component of the given pipeline
34
                        instance
35
  --log-file LOG_FILE   Read log data from a regular file
36
  --skip-child-jobs     Do not include stats from child jobs
37
  --format {html,text}  Report format
38
  --verbose, -v         Log more information (once for progress, twice for
39
                        debug)
40
</pre>
41
42 1 Bryan Cosca
--text mode
43
using node recommendations, keep cache size
44
45
--html mode
46
check if you're cpu/io bound
47
check if tasks are being weird, i.e. gatk queue case
48 2 Bryan Cosca
49 1 Bryan Cosca
when to pipe and when to write to keep
50 3 Bryan Cosca
in general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound. If you're seeing cpu level off and keep-read or keep-write taking too long, then you're io bound.
51 1 Bryan Cosca
52 3 Bryan Cosca
h3. choosing the right number of jobs
53
54
each job must output a collection, so if you don't want to output a file, then 
55
56
h2. Job Optimization
57 1 Bryan Cosca
h3. How to optimize the number when you don't have native multithreading
58
59
tools like gatk, blah blah have native multithreading where you pass a -t.
60
tools like varscan/freebayes blah blah don't have native multithreading so you need to find a workaround. generally, some tools have a -L --intervals to pass in certain loci to work on. If you have a bed file you can split on, then you can create a new task per interval.
61
example here
62 2 Bryan Cosca
63 1 Bryan Cosca
h3. piping between tools or writing to a tmpdir.