Pipeline Optimization » History » Version 3
Bryan Cosca, 04/14/2016 06:03 PM
1 | 1 | Bryan Cosca | h1. Pipeline Optimization |
---|---|---|---|
2 | |||
3 | h2. Crunchstat Summary |
||
4 | 2 | Bryan Cosca | |
5 | 3 | Bryan Cosca | Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, and Keep network traffic across the duration of a job. |
6 | |||
7 | 1 | Bryan Cosca | h3. How to install crunchstat-summary |
8 | 2 | Bryan Cosca | |
9 | 3 | Bryan Cosca | <pre> |
10 | $ git clone https://github.com/curoverse/arvados.git |
||
11 | $ cd arvados/tools/crunchstat-summary/ |
||
12 | $ python setup.py build |
||
13 | $ python setup.py install --user |
||
14 | </pre> |
||
15 | |||
16 | 1 | Bryan Cosca | h3. How to use crunchstat-summary |
17 | |||
18 | 3 | Bryan Cosca | <pre> |
19 | $ ./bin/crunchstat-summary --help |
||
20 | usage: crunchstat-summary [-h] |
||
21 | [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE] |
||
22 | [--skip-child-jobs] [--format {html,text}] |
||
23 | [--verbose] |
||
24 | |||
25 | Summarize resource usage of an Arvados Crunch job |
||
26 | |||
27 | optional arguments: |
||
28 | -h, --help show this help message and exit |
||
29 | --job UUID Look up the specified job and read its log data from |
||
30 | Keep (or from the Arvados event log, if the job is |
||
31 | still running) |
||
32 | --pipeline-instance UUID |
||
33 | Summarize each component of the given pipeline |
||
34 | instance |
||
35 | --log-file LOG_FILE Read log data from a regular file |
||
36 | --skip-child-jobs Do not include stats from child jobs |
||
37 | --format {html,text} Report format |
||
38 | --verbose, -v Log more information (once for progress, twice for |
||
39 | debug) |
||
40 | </pre> |
||
41 | |||
42 | 1 | Bryan Cosca | --text mode |
43 | using node recommendations, keep cache size |
||
44 | |||
45 | --html mode |
||
46 | check if you're cpu/io bound |
||
47 | check if tasks are being weird, i.e. gatk queue case |
||
48 | 2 | Bryan Cosca | |
49 | 1 | Bryan Cosca | when to pipe and when to write to keep |
50 | 3 | Bryan Cosca | in general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound. If you're seeing cpu level off and keep-read or keep-write taking too long, then you're io bound. |
51 | 1 | Bryan Cosca | |
52 | 3 | Bryan Cosca | h3. choosing the right number of jobs |
53 | |||
54 | each job must output a collection, so if you don't want to output a file, then |
||
55 | |||
56 | h2. Job Optimization |
||
57 | 1 | Bryan Cosca | h3. How to optimize the number when you don't have native multithreading |
58 | |||
59 | tools like gatk, blah blah have native multithreading where you pass a -t. |
||
60 | tools like varscan/freebayes blah blah don't have native multithreading so you need to find a workaround. generally, some tools have a -L --intervals to pass in certain loci to work on. If you have a bed file you can split on, then you can create a new task per interval. |
||
61 | example here |
||
62 | 2 | Bryan Cosca | |
63 | 1 | Bryan Cosca | h3. piping between tools or writing to a tmpdir. |