Pipeline Optimization » History » Version 4
Bryan Cosca, 04/14/2016 06:16 PM
1 | 1 | Bryan Cosca | h1. Pipeline Optimization |
---|---|---|---|
2 | |||
3 | h2. Crunchstat Summary |
||
4 | 2 | Bryan Cosca | |
5 | 4 | Bryan Cosca | Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, RAM, and Keep network traffic across the duration of a job. |
6 | 3 | Bryan Cosca | |
7 | 1 | Bryan Cosca | h3. How to install crunchstat-summary |
8 | 2 | Bryan Cosca | |
9 | 3 | Bryan Cosca | <pre> |
10 | $ git clone https://github.com/curoverse/arvados.git |
||
11 | $ cd arvados/tools/crunchstat-summary/ |
||
12 | $ python setup.py build |
||
13 | $ python setup.py install --user |
||
14 | </pre> |
||
15 | |||
16 | 1 | Bryan Cosca | h3. How to use crunchstat-summary |
17 | |||
18 | 3 | Bryan Cosca | <pre> |
19 | $ ./bin/crunchstat-summary --help |
||
20 | usage: crunchstat-summary [-h] |
||
21 | [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE] |
||
22 | [--skip-child-jobs] [--format {html,text}] |
||
23 | [--verbose] |
||
24 | |||
25 | Summarize resource usage of an Arvados Crunch job |
||
26 | |||
27 | optional arguments: |
||
28 | -h, --help show this help message and exit |
||
29 | --job UUID Look up the specified job and read its log data from |
||
30 | Keep (or from the Arvados event log, if the job is |
||
31 | still running) |
||
32 | --pipeline-instance UUID |
||
33 | Summarize each component of the given pipeline |
||
34 | instance |
||
35 | --log-file LOG_FILE Read log data from a regular file |
||
36 | --skip-child-jobs Do not include stats from child jobs |
||
37 | --format {html,text} Report format |
||
38 | --verbose, -v Log more information (once for progress, twice for |
||
39 | debug) |
||
40 | </pre> |
||
41 | |||
42 | 1 | Bryan Cosca | --text mode |
43 | using node recommendations, keep cache size |
||
44 | |||
45 | --html mode |
||
46 | check if you're cpu/io bound |
||
47 | check if tasks are being weird, i.e. gatk queue case |
||
48 | 2 | Bryan Cosca | |
49 | 1 | Bryan Cosca | when to pipe and when to write to keep |
50 | 3 | Bryan Cosca | in general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound. If you're seeing cpu level off and keep-read or keep-write taking too long, then you're io bound. |
51 | 1 | Bryan Cosca | |
52 | 3 | Bryan Cosca | h3. choosing the right number of jobs |
53 | |||
54 | each job must output a collection, so if you don't want to output a file, then |
||
55 | |||
56 | h2. Job Optimization |
||
57 | 1 | Bryan Cosca | h3. How to optimize the number when you don't have native multithreading |
58 | |||
59 | tools like gatk, blah blah have native multithreading where you pass a -t. |
||
60 | tools like varscan/freebayes blah blah don't have native multithreading so you need to find a workaround. generally, some tools have a -L --intervals to pass in certain loci to work on. If you have a bed file you can split on, then you can create a new task per interval. |
||
61 | example here |
||
62 | 2 | Bryan Cosca | |
63 | 1 | Bryan Cosca | h3. piping between tools or writing to a tmpdir. |