Project

General

Profile

Pipeline Optimization » History » Revision 16

Revision 15 (Bryan Cosca, 04/19/2016 01:39 PM) → Revision 16/31 (Bryan Cosca, 04/19/2016 02:47 PM)

h1. Pipeline Optimization 

 This wiki page is designed to help users make their arvados pipelines cost and compute efficient for production level data. 

 h2. Crunchstat Summary 

 Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose "runtime_constraints":http://doc.arvados.org/api/schema/Job.html specified in the pipeline template under each job by graphing job statistics. For example: job, as well as graph general statistics for the job, for example, CPU usage, RAM, and Keep network traffic over time. across the duration of a job. 

 h3. How to install crunchstat-summary 

 <pre> 
 $ git clone https://github.com/curoverse/arvados.git 
 $ cd arvados/tools/crunchstat-summary/ 
 $ python setup.py build 
 $ python setup.py install --user 
 </pre> 

 h3. How to use crunchstat-summary 

 <pre> 
 $ ./bin/crunchstat-summary --help 
 usage: crunchstat-summary [-h] 
                           [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE] 
                           [--skip-child-jobs] [--format {html,text}] 
                           [--verbose] 

 Summarize resource usage of an Arvados Crunch job 

 optional arguments: 
   -h, --help              show this help message and exit 
   --job UUID              Look up the specified job and read its log data from 
                         Keep (or from the Arvados event log, if the job is 
                         still running) 
   --pipeline-instance UUID 
                         Summarize each component of the given pipeline 
                         instance 
   --log-file LOG_FILE     Read log data from a regular file 
   --skip-child-jobs       Do not include stats from child jobs 
   --format {html,text}    Report format 
   --verbose, -v           Log more information (once for progress, twice for 
                         debug) 
 </pre> 

 There are two ways of using crunchstat-summary: crunchstat-summary, a text view for view, and an overall view html page with graphing of a job or an html page, which graphs usage over time. 

 Case study 1: A job that does bwa-aln mapping and converts to bam using samtools. 

 <pre> 
 category          metric    task_max          task_max_rate     job_total 
 blkio:202:0       read      310334464         -         913853440 
 blkio:202:0       write     2567127040        -         7693406208 
 blkio:202:16      read      8036201472        155118884.01      4538585088 
 blkio:202:16      write     55502038016       0         0 
 blkio:202:32      read      2756608 100760.59         6717440 
 blkio:202:32      write     53570560          0         99514368 
 cpu       cpus      8         -         - 
 cpu       sys       1592.34 1.17      805.32 
 cpu       user      11061.28          7.98      4620.17 
 cpu       user+sys          12653.62          8.00      5425.49 
 mem       cache     7454289920        -         - 
 mem       pgmajfault        1859      -         830 
 mem       rss       7965265920        -         - 
 mem       swap      5537792 -         - 
 net:docker0       rx        2023609029        -         2093089079 
 net:docker0       tx        21404100070       -         49909181906 
 net:docker0       tx+rx     23427709099       -         52002270985 
 net:eth0          rx        44750669842       67466325.07       14233805360 
 net:eth0          tx        2126085781        20171074.09       3670464917 
 net:eth0          tx+rx     46876755623       67673532.73       17904270277 
 time      elapsed 949       -         1899 
 # Number of tasks: 3 
 # Max CPU time spent by a single task: 12653.62s 
 # Max CPU usage in a single interval: 799.88% 
 # Overall CPU usage: 285.70% 
 # Max memory used by a single task: 7.97GB 
 # Max network traffic in a single task: 46.88GB 
 # Max network speed in a single interval: 67.67MB/s 
 # Keep cache miss rate 0.00% 
 # Keep cache utilization 0.00% 
 #!! qr1hi-8i9sb-bzn6hzttfu9cetv max CPU usage was 800% -- try runtime_constraints "min_cores_per_node":8 
 #!! qr1hi-8i9sb-bzn6hzttfu9cetv max RSS was 7597 MiB -- try runtime_constraints "min_ram_mb_per_node":7782 
 </pre> 

 !86538baca4ecef099d9fad76ad9c7180.png! 

 Here, you can see the distinct computation steps between the bwa-aln and the samtools step. Since there There is a noticeable plateau on CPU usage for both computations, CPU, so it would could be worth trying it to run the job on try upgrading to a bigger node. For example, a 16 core node to see if the computation plateau is actually at 8 cpus or if it can scale higher than 8 cores.  

 Another thing to note is you higher. You can also see the runtime_constraints recommendations. These recommendations are for you to set to ensure the job will be able to call the right node type and run reliably when reproduced. 

 Case study 2: FastQC 

 <pre> 
 category 	 metric 	 task_max 	 task_max_rate 	 job_total 
 blkio:0:0 	 read 	 174349211138 	 65352499.20 	 174349211138 
 blkio:0:0 	 write 	 0 	 0 	 0 
 cpu 	 cpus 	 8 	 - 	 - 
 cpu 	 sys 	 364.95 	 0.17 	 364.95 
 cpu 	 user 	 17589.59 	 6.59 	 17589.59 
 cpu 	 user+sys 	 17954.54 	 6.72 	 17954.54 
 fuseops 	 read 	 1330241 	 498.40 	 1330241 
 fuseops 	 write 	 0 	 0 	 0 
 keepcache 	 hit 	 2655806 	 1038.00 	 2655806 
 keepcache 	 miss 	 2633 	 1.60 	 2633 
 keepcalls 	 get 	 2658439 	 1039.00 	 2658439 
 keepcalls 	 put 	 0 	 0 	 0 
 mem 	 cache 	 19836608512 	 - 	 - 
 mem 	 pgmajfault 	 19 	 - 	 19 
 mem 	 rss 	 1481367552 	 - 	 - 
 net:eth0 	 rx 	 178321 	 17798.40 	 178321 
 net:eth0 	 tx 	 7156 	 685.00 	 7156 
 net:eth0 	 tx+rx 	 185477 	 18483.40 	 185477 
 net:keep0 	 rx 	 175959092914 	 107337311.20 	 175959092914 
 net:keep0 	 tx 	 0 	 0 	 0 
 net:keep0 	 tx+rx 	 175959092914 	 107337311.20 	 175959092914 
 time 	 elapsed 	 3301 	 - 	 3301 
 # Number of tasks: 1 
 # Max CPU time spent by a single task: 17954.54s 
 # Max CPU usage in a single interval: 672.01% 
 # Overall CPU usage: 543.91% 
 # Max memory used by a single task: 1.48GB 
 # Max network traffic in a single task: 175.96GB 
 # Max network speed in a single interval: 107.36MB/s 
 # Keep cache miss rate 0.10% 
 # Keep cache utilization 99.09% 
 #!! qr1hi-8i9sb-nxqqxravvapt10h max CPU usage was 673% -- try runtime_constraints "min_cores_per_node":7 
 #!! qr1hi-8i9sb-nxqqxravvapt10h max RSS was 1413 MiB -- try runtime_constraints "min_ram_mb_per_node":1945 
 </pre> 

 !62222dc72a51c18c15836796e91f3bc7.png! 

 One thing to point out here is "keep_cache utilization":http://doc.arvados.org/api/schema/Job.html, which can be changed using 'keep_cache_mb_per_task'. You can see keep cache utilization at 99.09%, which means its at a good point. You can try increasing this since it is almost at 100%, but it may not yield significant gains. 

 Another thing to note is to look at the CPU usage and keep transfer rate graphs. You should look to can also see if they ever mirror each other, which is a sign of a cpu bound job, or an keep i/o bound job. For example, if keep transfer is low but CPU usage is high, then your job is highly dependent on CPU, mimic the cpu, which means you should upgrade to mean its a higher core node. If CPU usage is low healthy job and keep transfer is high, then you may want to increase the keep_cache_mb_per_task in order to be able to compute on more data. not cpu/io bound. 

 h2. Job Optimization 

 h3. Writing When to keep 

 There are two ways to write your output collection to keep. Writing straight to keep ( arvados.crunch.TaskOutputDir() ) and vs staging a file in a temporary directory and then uploading to keep. after. 

 In general, general writing straight to keep will reap more benefits. TaskOutputDir acts like a pipe, so If you never have to spend node time on uploading data. 
 One problem though is if your job is dependent on using your output directory as run crunchstat-summary --html and you see keep io stopping once in a temporary space for files. while, then you're probably cpu bound. If your job uses its output directory for computation, you're seeing sporadic cpu usage and keep taking too long, then your job will be trying to compute over a network and could become very slow. you're probably i/o bound. 

 That being said, it's very safe for a job to write to a temporary directory then spending time to write the file to keep. On the other hand, writing straight to keep would save all the compute time uploading of writing to keep. If you have time, it's worth trying both and seeing how much time you save by doing both. Most of the time, writing straight to keep using TaskOutputDir will be the right option, but using a tmpdir is always the safe alternative. 

 h3. Choosing usually depends on how your tool works with an output directory. If its reading/writing from it a lot, then it might be worth using a temporary directory on SSD rather than going through the right number of jobs network. If it's just treating the output directory as a space for stdout then using TaskOutputDir should work just fine. 

 The h3. choosing the right number of jobs depends on how versatile you want your pipeline to be. Do you want to output all your intermediate files to keep? How many checkpoints do you want your pipeline to have? 

 One thing to note is that each Each job must output a collection of files. If collection, so if you don't want to output files from a command, file, then you should combine multiple commands in a job. You always choose what to upload to keep, so if you don't need files later on, its best to leave it on the compute node. 

 with each other. If you want a lot of checkpoints 'checkpoints' you should have a job for each command. You'll be able But the downside is more outputs. One upside to resume/restart work easily if any unexpected interruption happens. Also, having more jobs is that you can choose different node types nodetypes for each command. For example, BWA-mem can scale a lot better than fastqc or varscan, so having a 16 core node for something that doesn't have native multi-threading multi threading would be wasteful. 

 h3. Choosing choosing the right number of tasks 

 max_tasks_per_node allows you to choose how many tasks you would like to run on a machine. For example, if you have a lot of small tasks that use 1 core/1GB ram, you can put multiple of those on a bigger machine. For example, 8 tasks on an 8 core machine. If you want to utilize machines better for cost savings, you should use crunchstat-summary to find out the maximum memory/cpu usage for one task, and see if you can fit more than 1 of those on a machine. One warning, however is if you do run out of RAM (some compute nodes can't swap) your process will die with an extraneous error. Sometimes the error is obvious, sometimes its a red herring. 

 h3. How to optimize the number of tasks when you don't have native multithreading 

 Tools tools like GATK gatk have native multi-threading multithreading where if you pass a -t, it will use the correct number of cores on the node. You -t. Here, you usually want take advantage of this, to use that threading, and choose the min_cores_per_node that equals your threading parameter. min_cores_per_node. You can use any number of min_tasks_per_node making sure that your tool-threading*min_tasks_per_node tool_threading*min_tasks_per_node is <= min_cores_per_node. Also making sure that your node has enough RAM to allocate to all the tasks. 

 Tools tools like varscan/freebayes don't have native multi-threading multithreading so you need to find a workaround. Generally, these tools have a -L/--intervals to pass in certain loci to work on. If you have a bed file you can split reads on, then you can create a new task per interval. Then, have a job merge the outputs together. 

 h3. Piping piping between tools or writing to a temporary directory. tmpdir. 

 Creating pipes between tools has shown to sometimes be faster than writing/reading from disk. Feel free to pipe your tools together, for example using subprocess.PIPE in the "python subprocess module":https://docs.python.org/2/library/subprocess.html. Sometimes piping is faster, sometimes it's not. You'll have to try for yourself.