Project

General

Profile

Pipeline Optimization » History » Revision 9

Revision 8 (Bryan Cosca, 04/15/2016 03:37 PM) → Revision 9/31 (Bryan Cosca, 04/15/2016 03:42 PM)

h1. Pipeline Optimization 

 h2. Crunchstat Summary 

 Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, RAM, and Keep network traffic across the duration of a job. 

 h3. How to install crunchstat-summary 

 <pre> 
 $ git clone https://github.com/curoverse/arvados.git 
 $ cd arvados/tools/crunchstat-summary/ 
 $ python setup.py build 
 $ python setup.py install --user 
 </pre> 

 h3. How to use crunchstat-summary 

 <pre> 
 $ ./bin/crunchstat-summary --help 
 usage: crunchstat-summary [-h] 
                           [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE] 
                           [--skip-child-jobs] [--format {html,text}] 
                           [--verbose] 

 Summarize resource usage of an Arvados Crunch job 

 optional arguments: 
   -h, --help              show this help message and exit 
   --job UUID              Look up the specified job and read its log data from 
                         Keep (or from the Arvados event log, if the job is 
                         still running) 
   --pipeline-instance UUID 
                         Summarize each component of the given pipeline 
                         instance 
   --log-file LOG_FILE     Read log data from a regular file 
   --skip-child-jobs       Do not include stats from child jobs 
   --format {html,text}    Report format 
   --verbose, -v           Log more information (once for progress, twice for 
                         debug) 
 </pre> 

 Example job: bwa-aln + samtools -Sb 

 <pre> 
 category          metric    task_max          task_max_rate     job_total 
 blkio:202:0       read      310334464         -         913853440 
 blkio:202:0       write     2567127040        -         7693406208 
 blkio:202:16      read      8036201472        155118884.01      4538585088 
 blkio:202:16      write     55502038016       0         0 
 blkio:202:32      read      2756608 100760.59         6717440 
 blkio:202:32      write     53570560          0         99514368 
 cpu       cpus      8         -         - 
 cpu       sys       1592.34 1.17      805.32 
 cpu       user      11061.28          7.98      4620.17 
 cpu       user+sys          12653.62          8.00      5425.49 
 mem       cache     7454289920        -         - 
 mem       pgmajfault        1859      -         830 
 mem       rss       7965265920        -         - 
 mem       swap      5537792 -         - 
 net:docker0       rx        2023609029        -         2093089079 
 net:docker0       tx        21404100070       -         49909181906 
 net:docker0       tx+rx     23427709099       -         52002270985 
 net:eth0          rx        44750669842       67466325.07       14233805360 
 net:eth0          tx        2126085781        20171074.09       3670464917 
 net:eth0          tx+rx     46876755623       67673532.73       17904270277 
 time      elapsed 949       -         1899 
 # Number of tasks: 3 
 # Max CPU time spent by a single task: 12653.62s 
 # Max CPU usage in a single interval: 799.88% 
 # Overall CPU usage: 285.70% 
 # Max memory used by a single task: 7.97GB 
 # Max network traffic in a single task: 46.88GB 
 # Max network speed in a single interval: 67.67MB/s 
 # Keep cache miss rate 0.00% 
 # Keep cache utilization 0.00% 
 #!! qr1hi-8i9sb-bzn6hzttfu9cetv max CPU usage was 800% -- try runtime_constraints "min_cores_per_node":8 
 #!! qr1hi-8i9sb-bzn6hzttfu9cetv max RSS was 7597 MiB -- try runtime_constraints "min_ram_mb_per_node":7782 
 </pre> 

 !86538baca4ecef099d9fad76ad9c7180.png! 

 Here, you can You'll see on the distinct computation between bottom what the bwa-aln and the samtools step.  

 runtime_recommendations are. 

 --text mode 
 using node recommendations, keep cache size 

 --html mode 
 check if you're cpu/io bound 
 check if tasks are being weird, i.e. gatk queue case 

 !86538baca4ecef099d9fad76ad9c7180.png! 

 h3. When to pipe and when to write to keep 

 in general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound. If you're seeing cpu level off and keep-read or keep-write taking too long, then you're io bound. 

 That being said, it's very safe for a job to write to a temporary directory then spending time to write the file to keep. On the other hand, writing straight to keep would save all the compute time of writing to keep. If you have time, it's worth trying both and seeing how much time you save by doing both. Most of the time, writing straight to keep using TaskOutputDir will be the right option, but using a tmpdir is always the safe alternative. 

 Choosing usually depends on how your tool works with an output directory. If its reading/writing from it a lot, then it might be worth using a tmpdir (SSD) rather than going through the network. If it's just treating the output directory as a space for stdout then using TaskOutputDir should work. 

 h3. choosing the right number of jobs 

 each job must output a collection, so if you don't want to output a file, then you should combine commands with each other. 

 h2. Job Optimization 

 h3. How to optimize the number of tasks when you don't have native multithreading 

 tools like gatk have native multithreading where you pass a -t. Here, you usually want to use that threading, and choose the min_cores_per_node. You can use any number of min_tasks_per_node making sure that your tool_threading*min_tasks_per_node is <= min_cores_per_node. 

 tools like varscan/freebayes blah blah don't have native multithreading so you need to find a workaround. generally, some tools have a -L --intervals to pass in certain loci to work on. If you have a bed file you can split reads on, then you can create a new task per interval. If  

 h3. piping between tools or writing to a tmpdir. 

 Creating pipes between tools has shown to sometimes be faster than writing/reading from disk. Feel free to pipe your tools together, for example using subprocess.PIPE in the "python subprocess module":https://docs.python.org/2/library/subprocess.html