Project

General

Profile

Pipeline Optimization » History » Version 14

Bryan Cosca, 04/15/2016 08:42 PM

1 1 Bryan Cosca
h1. Pipeline Optimization
2
3 14 Bryan Cosca
This wiki page is designed to help users make their arvados pipelines cost and compute efficient for production level data.
4 11 Bryan Cosca
5 1 Bryan Cosca
h2. Crunchstat Summary
6 2 Bryan Cosca
7 4 Bryan Cosca
Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, RAM, and Keep network traffic across the duration of a job.
8 3 Bryan Cosca
9 1 Bryan Cosca
h3. How to install crunchstat-summary
10 2 Bryan Cosca
11 3 Bryan Cosca
<pre>
12
$ git clone https://github.com/curoverse/arvados.git
13
$ cd arvados/tools/crunchstat-summary/
14
$ python setup.py build
15
$ python setup.py install --user
16
</pre>
17
18 1 Bryan Cosca
h3. How to use crunchstat-summary
19
20 3 Bryan Cosca
<pre>
21
$ ./bin/crunchstat-summary --help
22
usage: crunchstat-summary [-h]
23
                          [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE]
24
                          [--skip-child-jobs] [--format {html,text}]
25
                          [--verbose]
26
27
Summarize resource usage of an Arvados Crunch job
28
29
optional arguments:
30
  -h, --help            show this help message and exit
31
  --job UUID            Look up the specified job and read its log data from
32
                        Keep (or from the Arvados event log, if the job is
33
                        still running)
34
  --pipeline-instance UUID
35
                        Summarize each component of the given pipeline
36
                        instance
37
  --log-file LOG_FILE   Read log data from a regular file
38
  --skip-child-jobs     Do not include stats from child jobs
39
  --format {html,text}  Report format
40
  --verbose, -v         Log more information (once for progress, twice for
41
                        debug)
42
</pre>
43 1 Bryan Cosca
44 14 Bryan Cosca
There are two ways of using crunchstat-summary, a text view, and an html page with graphing of usage over time.
45
46 12 Bryan Cosca
Case study 1: A job that does bwa-aln mapping and converts to bam using samtools.
47 8 Bryan Cosca
48
<pre>
49
category        metric  task_max        task_max_rate   job_total
50
blkio:202:0     read    310334464       -       913853440
51
blkio:202:0     write   2567127040      -       7693406208
52
blkio:202:16    read    8036201472      155118884.01    4538585088
53
blkio:202:16    write   55502038016     0       0
54
blkio:202:32    read    2756608 100760.59       6717440
55
blkio:202:32    write   53570560        0       99514368
56
cpu     cpus    8       -       -
57
cpu     sys     1592.34 1.17    805.32
58
cpu     user    11061.28        7.98    4620.17
59
cpu     user+sys        12653.62        8.00    5425.49
60
mem     cache   7454289920      -       -
61
mem     pgmajfault      1859    -       830
62
mem     rss     7965265920      -       -
63
mem     swap    5537792 -       -
64
net:docker0     rx      2023609029      -       2093089079
65
net:docker0     tx      21404100070     -       49909181906
66
net:docker0     tx+rx   23427709099     -       52002270985
67
net:eth0        rx      44750669842     67466325.07     14233805360
68
net:eth0        tx      2126085781      20171074.09     3670464917
69
net:eth0        tx+rx   46876755623     67673532.73     17904270277
70
time    elapsed 949     -       1899
71
# Number of tasks: 3
72
# Max CPU time spent by a single task: 12653.62s
73
# Max CPU usage in a single interval: 799.88%
74
# Overall CPU usage: 285.70%
75
# Max memory used by a single task: 7.97GB
76
# Max network traffic in a single task: 46.88GB
77
# Max network speed in a single interval: 67.67MB/s
78
# Keep cache miss rate 0.00%
79
# Keep cache utilization 0.00%
80
#!! qr1hi-8i9sb-bzn6hzttfu9cetv max CPU usage was 800% -- try runtime_constraints "min_cores_per_node":8
81
#!! qr1hi-8i9sb-bzn6hzttfu9cetv max RSS was 7597 MiB -- try runtime_constraints "min_ram_mb_per_node":7782
82
</pre>
83 1 Bryan Cosca
84
!86538baca4ecef099d9fad76ad9c7180.png!
85
86 14 Bryan Cosca
Here, you can see the distinct computation between the bwa-aln and the samtools step. There is a plateau on CPU, so it could be worth it to try upgrading to a bigger node. For example, a 16 core node to see if the plateau is actually at 8 cpus or if it can scale higher. You can also see the runtime_constraints recommendations.
87 12 Bryan Cosca
88
Case study 2: FastQC
89
90
<pre>
91
category	metric	task_max	task_max_rate	job_total
92
blkio:0:0	read	174349211138	65352499.20	174349211138
93
blkio:0:0	write	0	0	0
94
cpu	cpus	8	-	-
95
cpu	sys	364.95	0.17	364.95
96
cpu	user	17589.59	6.59	17589.59
97
cpu	user+sys	17954.54	6.72	17954.54
98
fuseops	read	1330241	498.40	1330241
99
fuseops	write	0	0	0
100
keepcache	hit	2655806	1038.00	2655806
101
keepcache	miss	2633	1.60	2633
102
keepcalls	get	2658439	1039.00	2658439
103
keepcalls	put	0	0	0
104
mem	cache	19836608512	-	-
105
mem	pgmajfault	19	-	19
106
mem	rss	1481367552	-	-
107
net:eth0	rx	178321	17798.40	178321
108
net:eth0	tx	7156	685.00	7156
109
net:eth0	tx+rx	185477	18483.40	185477
110
net:keep0	rx	175959092914	107337311.20	175959092914
111
net:keep0	tx	0	0	0
112
net:keep0	tx+rx	175959092914	107337311.20	175959092914
113
time	elapsed	3301	-	3301
114
# Number of tasks: 1
115
# Max CPU time spent by a single task: 17954.54s
116
# Max CPU usage in a single interval: 672.01%
117
# Overall CPU usage: 543.91%
118
# Max memory used by a single task: 1.48GB
119
# Max network traffic in a single task: 175.96GB
120
# Max network speed in a single interval: 107.36MB/s
121
# Keep cache miss rate 0.10%
122
# Keep cache utilization 99.09%
123
#!! qr1hi-8i9sb-nxqqxravvapt10h max CPU usage was 673% -- try runtime_constraints "min_cores_per_node":7
124
#!! qr1hi-8i9sb-nxqqxravvapt10h max RSS was 1413 MiB -- try runtime_constraints "min_ram_mb_per_node":1945
125
</pre>
126 1 Bryan Cosca
127
!62222dc72a51c18c15836796e91f3bc7.png!
128 13 Bryan Cosca
129 14 Bryan Cosca
One thing to point out here is "keep_cache utilization"http://doc.arvados.org/api/schema/Job.html, which can be changed using 'keep_cache_mb_per_task'. You can see keep cache utilization at 99.09%, which means its at a good point. You can try increasing this since it is almost at 100%, but it may not yield significant gains.
130 13 Bryan Cosca
131 14 Bryan Cosca
You can also see keep i/o mimic the cpu, which should mean its a healthy job and not cpu/io bound.
132 1 Bryan Cosca
133 11 Bryan Cosca
h2. Job Optimization
134 1 Bryan Cosca
135 11 Bryan Cosca
h3. When to write straight to keep vs staging a file in a temporary directory and uploading after.
136 1 Bryan Cosca
137 14 Bryan Cosca
In general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then you're probably cpu bound. If you're seeing sporadic cpu usage and keep taking too long, then you're probably i/o bound.
138 1 Bryan Cosca
139
That being said, it's very safe for a job to write to a temporary directory then spending time to write the file to keep. On the other hand, writing straight to keep would save all the compute time of writing to keep. If you have time, it's worth trying both and seeing how much time you save by doing both. Most of the time, writing straight to keep using TaskOutputDir will be the right option, but using a tmpdir is always the safe alternative.
140
141 11 Bryan Cosca
Choosing usually depends on how your tool works with an output directory. If its reading/writing from it a lot, then it might be worth using a temporary directory on SSD rather than going through the network. If it's just treating the output directory as a space for stdout then using TaskOutputDir should work just fine.
142 1 Bryan Cosca
143
h3. choosing the right number of jobs
144
145 14 Bryan Cosca
Each job must output a collection, so if you don't want to output a file, then you should combine commands with each other. If you want a lot of 'checkpoints' you should have a job for each command. But the downside is more outputs. One upside to having more jobs is that you can choose nodetypes for each command. For example, BWA-mem can scale a lot better than fastqc or varscan, so having a 16 core node for something that doesn't have native multi threading would be wasteful.
146 1 Bryan Cosca
147 11 Bryan Cosca
h3. choosing the right number of tasks
148 1 Bryan Cosca
149 11 Bryan Cosca
max_tasks_per_node allows you to choose how many tasks you would like to run on a machine. For example, if you have a lot of small tasks that use 1 core/1GB ram, you can put multiple of those on a bigger machine. For example, 8 tasks on an 8 core machine. If you want to utilize machines better for cost savings, you should use crunchstat-summary to find out the maximum memory/cpu usage for one task, and see if you can fit more than 1 of those on a machine. One warning, however is if you do run out of RAM (some compute nodes can't swap) your process will die with an extraneous error. Sometimes the error is obvious, sometimes its a red herring.
150
151 5 Bryan Cosca
h3. How to optimize the number of tasks when you don't have native multithreading
152
153 6 Bryan Cosca
tools like gatk have native multithreading where you pass a -t. Here, you usually want to use that threading, and choose the min_cores_per_node. You can use any number of min_tasks_per_node making sure that your tool_threading*min_tasks_per_node is <= min_cores_per_node.
154
155 11 Bryan Cosca
tools like varscan/freebayes don't have native multithreading so you need to find a workaround. Generally, these tools have a -L/--intervals to pass in certain loci to work on. If you have a bed file you can split reads on, then you can create a new task per interval. Then, have a job merge the outputs together.
156 1 Bryan Cosca
157
h3. piping between tools or writing to a tmpdir.
158 5 Bryan Cosca
159 11 Bryan Cosca
Creating pipes between tools has shown to sometimes be faster than writing/reading from disk. Feel free to pipe your tools together, for example using subprocess.PIPE in the "python subprocess module":https://docs.python.org/2/library/subprocess.html. Sometimes piping is faster, sometimes it's not. You'll have to try for yourself.