Pipeline Optimization » History » Version 8
Bryan Cosca, 04/15/2016 03:37 PM
1 | 1 | Bryan Cosca | h1. Pipeline Optimization |
---|---|---|---|
2 | |||
3 | h2. Crunchstat Summary |
||
4 | 2 | Bryan Cosca | |
5 | 4 | Bryan Cosca | Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, RAM, and Keep network traffic across the duration of a job. |
6 | 3 | Bryan Cosca | |
7 | 1 | Bryan Cosca | h3. How to install crunchstat-summary |
8 | 2 | Bryan Cosca | |
9 | 3 | Bryan Cosca | <pre> |
10 | $ git clone https://github.com/curoverse/arvados.git |
||
11 | $ cd arvados/tools/crunchstat-summary/ |
||
12 | $ python setup.py build |
||
13 | $ python setup.py install --user |
||
14 | </pre> |
||
15 | |||
16 | 1 | Bryan Cosca | h3. How to use crunchstat-summary |
17 | |||
18 | 3 | Bryan Cosca | <pre> |
19 | $ ./bin/crunchstat-summary --help |
||
20 | usage: crunchstat-summary [-h] |
||
21 | [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE] |
||
22 | [--skip-child-jobs] [--format {html,text}] |
||
23 | [--verbose] |
||
24 | |||
25 | Summarize resource usage of an Arvados Crunch job |
||
26 | |||
27 | optional arguments: |
||
28 | -h, --help show this help message and exit |
||
29 | --job UUID Look up the specified job and read its log data from |
||
30 | Keep (or from the Arvados event log, if the job is |
||
31 | still running) |
||
32 | --pipeline-instance UUID |
||
33 | Summarize each component of the given pipeline |
||
34 | instance |
||
35 | --log-file LOG_FILE Read log data from a regular file |
||
36 | --skip-child-jobs Do not include stats from child jobs |
||
37 | --format {html,text} Report format |
||
38 | --verbose, -v Log more information (once for progress, twice for |
||
39 | debug) |
||
40 | </pre> |
||
41 | |||
42 | 8 | Bryan Cosca | Example job: bwa-aln + samtools -Sb |
43 | |||
44 | <pre> |
||
45 | category metric task_max task_max_rate job_total |
||
46 | blkio:202:0 read 310334464 - 913853440 |
||
47 | blkio:202:0 write 2567127040 - 7693406208 |
||
48 | blkio:202:16 read 8036201472 155118884.01 4538585088 |
||
49 | blkio:202:16 write 55502038016 0 0 |
||
50 | blkio:202:32 read 2756608 100760.59 6717440 |
||
51 | blkio:202:32 write 53570560 0 99514368 |
||
52 | cpu cpus 8 - - |
||
53 | cpu sys 1592.34 1.17 805.32 |
||
54 | cpu user 11061.28 7.98 4620.17 |
||
55 | cpu user+sys 12653.62 8.00 5425.49 |
||
56 | mem cache 7454289920 - - |
||
57 | mem pgmajfault 1859 - 830 |
||
58 | mem rss 7965265920 - - |
||
59 | mem swap 5537792 - - |
||
60 | net:docker0 rx 2023609029 - 2093089079 |
||
61 | net:docker0 tx 21404100070 - 49909181906 |
||
62 | net:docker0 tx+rx 23427709099 - 52002270985 |
||
63 | net:eth0 rx 44750669842 67466325.07 14233805360 |
||
64 | net:eth0 tx 2126085781 20171074.09 3670464917 |
||
65 | net:eth0 tx+rx 46876755623 67673532.73 17904270277 |
||
66 | time elapsed 949 - 1899 |
||
67 | # Number of tasks: 3 |
||
68 | # Max CPU time spent by a single task: 12653.62s |
||
69 | # Max CPU usage in a single interval: 799.88% |
||
70 | # Overall CPU usage: 285.70% |
||
71 | # Max memory used by a single task: 7.97GB |
||
72 | # Max network traffic in a single task: 46.88GB |
||
73 | # Max network speed in a single interval: 67.67MB/s |
||
74 | # Keep cache miss rate 0.00% |
||
75 | # Keep cache utilization 0.00% |
||
76 | #!! qr1hi-8i9sb-bzn6hzttfu9cetv max CPU usage was 800% -- try runtime_constraints "min_cores_per_node":8 |
||
77 | #!! qr1hi-8i9sb-bzn6hzttfu9cetv max RSS was 7597 MiB -- try runtime_constraints "min_ram_mb_per_node":7782 |
||
78 | </pre> |
||
79 | |||
80 | You'll see on the bottom what the runtime_recommendations are. |
||
81 | |||
82 | 1 | Bryan Cosca | --text mode |
83 | using node recommendations, keep cache size |
||
84 | |||
85 | --html mode |
||
86 | check if you're cpu/io bound |
||
87 | check if tasks are being weird, i.e. gatk queue case |
||
88 | 8 | Bryan Cosca | |
89 | !86538baca4ecef099d9fad76ad9c7180.png! |
||
90 | 2 | Bryan Cosca | |
91 | 6 | Bryan Cosca | h3. When to pipe and when to write to keep |
92 | 7 | Bryan Cosca | |
93 | 3 | Bryan Cosca | in general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound. If you're seeing cpu level off and keep-read or keep-write taking too long, then you're io bound. |
94 | 1 | Bryan Cosca | |
95 | 6 | Bryan Cosca | That being said, it's very safe for a job to write to a temporary directory then spending time to write the file to keep. On the other hand, writing straight to keep would save all the compute time of writing to keep. If you have time, it's worth trying both and seeing how much time you save by doing both. Most of the time, writing straight to keep using TaskOutputDir will be the right option, but using a tmpdir is always the safe alternative. |
96 | |||
97 | Choosing usually depends on how your tool works with an output directory. If its reading/writing from it a lot, then it might be worth using a tmpdir (SSD) rather than going through the network. If it's just treating the output directory as a space for stdout then using TaskOutputDir should work. |
||
98 | |||
99 | 1 | Bryan Cosca | h3. choosing the right number of jobs |
100 | 3 | Bryan Cosca | |
101 | 6 | Bryan Cosca | each job must output a collection, so if you don't want to output a file, then you should combine commands with each other. |
102 | 1 | Bryan Cosca | |
103 | h2. Job Optimization |
||
104 | |||
105 | 5 | Bryan Cosca | h3. How to optimize the number of tasks when you don't have native multithreading |
106 | |||
107 | 6 | Bryan Cosca | tools like gatk have native multithreading where you pass a -t. Here, you usually want to use that threading, and choose the min_cores_per_node. You can use any number of min_tasks_per_node making sure that your tool_threading*min_tasks_per_node is <= min_cores_per_node. |
108 | |||
109 | tools like varscan/freebayes blah blah don't have native multithreading so you need to find a workaround. generally, some tools have a -L --intervals to pass in certain loci to work on. If you have a bed file you can split reads on, then you can create a new task per interval. If |
||
110 | 1 | Bryan Cosca | |
111 | h3. piping between tools or writing to a tmpdir. |
||
112 | 5 | Bryan Cosca | |
113 | Creating pipes between tools has shown to sometimes be faster than writing/reading from disk. Feel free to pipe your tools together, for example using subprocess.PIPE in the "python subprocess module":https://docs.python.org/2/library/subprocess.html |