Project

General

Profile

Pipeline Optimization » History » Version 2

Bryan Cosca, 04/14/2016 03:08 PM

1 1 Bryan Cosca
h1. Pipeline Optimization
2
3
h2. Crunchstat Summary
4 2 Bryan Cosca
5 1 Bryan Cosca
h3. How to install crunchstat-summary
6 2 Bryan Cosca
7 1 Bryan Cosca
h3. How to use crunchstat-summary
8 2 Bryan Cosca
9 1 Bryan Cosca
--text mode
10
using node recommendations, keep cache size
11
12
--html mode
13
check if you're cpu/io bound
14
check if tasks are being weird, i.e. gatk queue case
15
16
when to pipe and when to write to keep
17
in general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound.
18
19
h3. How to optimize the number when you don't have native multithreading
20 2 Bryan Cosca
21 1 Bryan Cosca
tools like gatk, blah blah have native multithreading where you pass a -t.
22
tools like varscan/freebayes blah blah don't have native multithreading so you need to find a workaround. generally, some tools have a -L --intervals to pass in certain loci to work on. If you have a bed file you can split on, then you can create a new task per interval.
23
example here
24
25
h3. piping between tools or writing to a tmpdir.
26
27
h3. choosing the right number of jobs
28 2 Bryan Cosca
29 1 Bryan Cosca
each job must output a collection, so if you don't want to output a file, then