Project

General

Profile

Pipeline Optimization » History » Version 18

Bryan Cosca, 04/20/2016 08:02 PM

1 1 Bryan Cosca
h1. Pipeline Optimization
2
3 17 Bryan Cosca
h2. Overview
4 1 Bryan Cosca
5 17 Bryan Cosca
This wiki page is designed to help users make their Arvados pipelines cost and compute efficient for production level data. This page will go over Arvados best practices for making your pipeline cost and compute efficient.
6
7
h2. Pipeline design
8
9
h3. Choosing the right number of jobs
10
11 18 Bryan Cosca
The right number of jobs depends on how versatile you want your pipeline to be. Specifically, how many steps do you want your pipeline to have?
12
Questions you mask ask yourself are: 
13 1 Bryan Cosca
14 18 Bryan Cosca
Do I want to output all your intermediate files to keep? 
15
How many checkpoints do I want my pipeline to have?
16
Do I want to do alignment and variant calling in one step? Should I separate them?
17 1 Bryan Cosca
18 18 Bryan Cosca
Each job must output a collection of files. If you don't want to output files from a command, you should combine multiple commands in a job. You always choose what to upload to keep, so if you don't need files later on, its best to leave it on the compute node.
19
20 1 Bryan Cosca
If you want a lot of checkpoints you should have a job for each command. You'll be able to resume/restart work easily if any unexpected interruption happens. Also, you can choose different node types for each command. For example, BWA-mem can scale a lot better than fastqc or varscan, so having a 16 core node for something that doesn't have native multi-threading would be wasteful.
21
22 18 Bryan Cosca
If you choose to do multiple computations in a job, you should try piping them together. Creating pipes between tools has shown to sometimes be faster than writing/reading from disk. Feel free to pipe your tools together, for example using subprocess.PIPE in the "python subprocess module":https://docs.python.org/2/library/subprocess.html. Sometimes piping is faster, sometimes it's not. You'll have to try for yourself.
23
24
An alternative option is using arvados.current_task.tmpdir to  store all your intermediate files, and then only upload what you need to keep.
25
26 17 Bryan Cosca
h3. Choosing the right number of tasks
27
28
max_tasks_per_node allows you to choose how many tasks you would like to run on a machine. For example, if you have a lot of small tasks that use 1 core/1GB ram, you can put multiple of those on a bigger machine. For example, 8 tasks on an 8 core machine. If you want to utilize machines better for cost savings, you should use crunchstat-summary to find out the maximum memory/cpu usage for one task, and see if you can fit more than 1 of those on a machine. One warning, however is if you do run out of RAM (some compute nodes can't swap) your process will die with an extraneous error. Sometimes the error is obvious, sometimes its a red herring.
29
30
h3. How to optimize the number of tasks when you don't have native multithreading
31
32
Tools like GATK have native multi-threading where if you pass a -t, it will use the correct number of cores on the node. You usually want take advantage of this, and choose the min_cores_per_node that equals your threading parameter. You can use any number of min_tasks_per_node making sure that your tool-threading*min_tasks_per_node is <= min_cores_per_node. Also making sure that your node has enough RAM to allocate to all the tasks.
33
34
Tools like varscan/freebayes don't have native multi-threading so you need to find a workaround. Generally, these tools have a -L/--intervals to pass in certain loci to work on. If you have a bed file you can split reads on, then you can create a new task per interval. Then, have a job merge the outputs together.
35
36
h3. Writing to keep
37
38
There are two ways to write your output collection to keep. Writing straight to keep ( arvados.crunch.TaskOutputDir() ) and staging a file in a temporary directory and then uploading to keep.
39
40
In general, writing straight to keep will reap more benefits. TaskOutputDir acts like a pipe, so you never have to spend node time on uploading data.
41
One problem though is if your job is dependent on using your output directory as a temporary space for files. If your job uses its output directory for computation, then your job will be trying to compute over a network and could become very slow. That being said, it's very safe for a job to write to a temporary directory then spending compute time uploading to keep. If you have time, it's worth trying both and seeing how much time you save by doing both. Most of the time, writing straight to keep using TaskOutputDir will be the right option, but using a tmpdir is always the safe alternative.
42
43 1 Bryan Cosca
h2. Crunchstat Summary
44 2 Bryan Cosca
45 16 Bryan Cosca
Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job by graphing job statistics. For example: CPU usage, RAM, and Keep network traffic over time.
46 3 Bryan Cosca
47 1 Bryan Cosca
h3. How to install crunchstat-summary
48 2 Bryan Cosca
49 3 Bryan Cosca
<pre>
50
$ git clone https://github.com/curoverse/arvados.git
51
$ cd arvados/tools/crunchstat-summary/
52
$ python setup.py build
53
$ python setup.py install --user
54
</pre>
55
56 1 Bryan Cosca
h3. How to use crunchstat-summary
57
58 3 Bryan Cosca
<pre>
59
$ ./bin/crunchstat-summary --help
60
usage: crunchstat-summary [-h]
61
                          [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE]
62
                          [--skip-child-jobs] [--format {html,text}]
63
                          [--verbose]
64
65
Summarize resource usage of an Arvados Crunch job
66
67
optional arguments:
68
  -h, --help            show this help message and exit
69
  --job UUID            Look up the specified job and read its log data from
70
                        Keep (or from the Arvados event log, if the job is
71
                        still running)
72
  --pipeline-instance UUID
73
                        Summarize each component of the given pipeline
74
                        instance
75
  --log-file LOG_FILE   Read log data from a regular file
76
  --skip-child-jobs     Do not include stats from child jobs
77
  --format {html,text}  Report format
78
  --verbose, -v         Log more information (once for progress, twice for
79
                        debug)
80
</pre>
81 1 Bryan Cosca
82 16 Bryan Cosca
There are two ways of using crunchstat-summary: a text view for an overall view of a job or an html page, which graphs usage over time.
83 14 Bryan Cosca
84 16 Bryan Cosca
Case 1: A job that does bwa-aln mapping and converts to bam using samtools.
85 8 Bryan Cosca
86
<pre>
87
category        metric  task_max        task_max_rate   job_total
88
blkio:202:0     read    310334464       -       913853440
89
blkio:202:0     write   2567127040      -       7693406208
90
blkio:202:16    read    8036201472      155118884.01    4538585088
91
blkio:202:16    write   55502038016     0       0
92
blkio:202:32    read    2756608 100760.59       6717440
93
blkio:202:32    write   53570560        0       99514368
94
cpu     cpus    8       -       -
95
cpu     sys     1592.34 1.17    805.32
96
cpu     user    11061.28        7.98    4620.17
97
cpu     user+sys        12653.62        8.00    5425.49
98
mem     cache   7454289920      -       -
99
mem     pgmajfault      1859    -       830
100
mem     rss     7965265920      -       -
101
mem     swap    5537792 -       -
102
net:docker0     rx      2023609029      -       2093089079
103
net:docker0     tx      21404100070     -       49909181906
104
net:docker0     tx+rx   23427709099     -       52002270985
105
net:eth0        rx      44750669842     67466325.07     14233805360
106
net:eth0        tx      2126085781      20171074.09     3670464917
107
net:eth0        tx+rx   46876755623     67673532.73     17904270277
108
time    elapsed 949     -       1899
109
# Number of tasks: 3
110
# Max CPU time spent by a single task: 12653.62s
111
# Max CPU usage in a single interval: 799.88%
112
# Overall CPU usage: 285.70%
113
# Max memory used by a single task: 7.97GB
114
# Max network traffic in a single task: 46.88GB
115
# Max network speed in a single interval: 67.67MB/s
116
# Keep cache miss rate 0.00%
117
# Keep cache utilization 0.00%
118
#!! qr1hi-8i9sb-bzn6hzttfu9cetv max CPU usage was 800% -- try runtime_constraints "min_cores_per_node":8
119
#!! qr1hi-8i9sb-bzn6hzttfu9cetv max RSS was 7597 MiB -- try runtime_constraints "min_ram_mb_per_node":7782
120
</pre>
121 1 Bryan Cosca
122
!86538baca4ecef099d9fad76ad9c7180.png!
123
124 16 Bryan Cosca
Here, you can see the distinct computation steps between the bwa-aln and the samtools step. Since there is a noticeable plateau on CPU usage for both computations, it would be worth trying to run the job on a bigger node. For example, a 16 core node to see if the computation can scale higher than 8 cores. 
125 1 Bryan Cosca
126 16 Bryan Cosca
Another thing to note is you can also see the runtime_constraints recommendations. These recommendations are for you to set to ensure the job will be able to call the right node type and run reliably when reproduced.
127
128 12 Bryan Cosca
Case study 2: FastQC
129
130
<pre>
131
category	metric	task_max	task_max_rate	job_total
132
blkio:0:0	read	174349211138	65352499.20	174349211138
133
blkio:0:0	write	0	0	0
134
cpu	cpus	8	-	-
135
cpu	sys	364.95	0.17	364.95
136
cpu	user	17589.59	6.59	17589.59
137
cpu	user+sys	17954.54	6.72	17954.54
138 1 Bryan Cosca
fuseops	read	1330241	498.40	1330241
139
fuseops	write	0	0	0
140 16 Bryan Cosca
keepcache	hit	2655806	1038.00	2655806
141 1 Bryan Cosca
keepcache	miss	2633	1.60	2633
142
keepcalls	get	2658439	1039.00	2658439
143
keepcalls	put	0	0	0
144 16 Bryan Cosca
mem	cache	19836608512	-	-
145 1 Bryan Cosca
mem	pgmajfault	19	-	19
146 16 Bryan Cosca
mem	rss	1481367552	-	-
147 1 Bryan Cosca
net:eth0	rx	178321	17798.40	178321
148 16 Bryan Cosca
net:eth0	tx	7156	685.00	7156
149
net:eth0	tx+rx	185477	18483.40	185477
150 11 Bryan Cosca
net:keep0	rx	175959092914	107337311.20	175959092914
151 16 Bryan Cosca
net:keep0	tx	0	0	0
152 1 Bryan Cosca
net:keep0	tx+rx	175959092914	107337311.20	175959092914
153 16 Bryan Cosca
time	elapsed	3301	-	3301
154 1 Bryan Cosca
# Number of tasks: 1
155 16 Bryan Cosca
# Max CPU time spent by a single task: 17954.54s
156 1 Bryan Cosca
# Max CPU usage in a single interval: 672.01%
157 16 Bryan Cosca
# Overall CPU usage: 543.91%
158 1 Bryan Cosca
# Max memory used by a single task: 1.48GB
159 16 Bryan Cosca
# Max network traffic in a single task: 175.96GB
160
# Max network speed in a single interval: 107.36MB/s
161 11 Bryan Cosca
# Keep cache miss rate 0.10%
162
# Keep cache utilization 99.09%
163 5 Bryan Cosca
#!! qr1hi-8i9sb-nxqqxravvapt10h max CPU usage was 673% -- try runtime_constraints "min_cores_per_node":7
164
#!! qr1hi-8i9sb-nxqqxravvapt10h max RSS was 1413 MiB -- try runtime_constraints "min_ram_mb_per_node":1945
165 16 Bryan Cosca
</pre>
166 6 Bryan Cosca
167 16 Bryan Cosca
!62222dc72a51c18c15836796e91f3bc7.png!
168 1 Bryan Cosca
169 16 Bryan Cosca
One thing to point out here is "keep_cache utilization":http://doc.arvados.org/api/schema/Job.html, which can be changed using 'keep_cache_mb_per_task'. You can see keep cache utilization at 99.09%, which means its at a good point. You can try increasing this since it is almost at 100%, but it may not yield significant gains.
170 5 Bryan Cosca
171 11 Bryan Cosca
Another thing to note is to look at the CPU usage and keep transfer rate graphs. You should look to see if they ever mirror each other, which is a sign of a cpu bound job, or an i/o bound job. For example, if keep transfer is low but CPU usage is high, then your job is highly dependent on CPU, which means you should upgrade to a higher core node. If CPU usage is low and keep transfer is high, then you may want to increase the keep_cache_mb_per_task in order to be able to compute on more data.