Pipeline Optimization » History » Revision 12
« Previous |
Revision 12/31
(diff)
| Next »
Bryan Cosca, 04/15/2016 08:01 PM
Pipeline Optimization¶
This wiki page is designed to help users make their pipelines cost and compute efficient for production level data.
Crunchstat Summary¶
Crunchstat-summary is an arvados tool to help choose optimal configurations for arvados jobs and pipeline instances. It helps you choose runtime_constraints specified in the pipeline template under each job, as well as graph general statistics for the job, for example, CPU usage, RAM, and Keep network traffic across the duration of a job.
How to install crunchstat-summary¶
$ git clone https://github.com/curoverse/arvados.git $ cd arvados/tools/crunchstat-summary/ $ python setup.py build $ python setup.py install --user
How to use crunchstat-summary¶
$ ./bin/crunchstat-summary --help usage: crunchstat-summary [-h] [--job UUID | --pipeline-instance UUID | --log-file LOG_FILE] [--skip-child-jobs] [--format {html,text}] [--verbose] Summarize resource usage of an Arvados Crunch job optional arguments: -h, --help show this help message and exit --job UUID Look up the specified job and read its log data from Keep (or from the Arvados event log, if the job is still running) --pipeline-instance UUID Summarize each component of the given pipeline instance --log-file LOG_FILE Read log data from a regular file --skip-child-jobs Do not include stats from child jobs --format {html,text} Report format --verbose, -v Log more information (once for progress, twice for debug)
Case study 1: A job that does bwa-aln mapping and converts to bam using samtools.
category metric task_max task_max_rate job_total blkio:202:0 read 310334464 - 913853440 blkio:202:0 write 2567127040 - 7693406208 blkio:202:16 read 8036201472 155118884.01 4538585088 blkio:202:16 write 55502038016 0 0 blkio:202:32 read 2756608 100760.59 6717440 blkio:202:32 write 53570560 0 99514368 cpu cpus 8 - - cpu sys 1592.34 1.17 805.32 cpu user 11061.28 7.98 4620.17 cpu user+sys 12653.62 8.00 5425.49 mem cache 7454289920 - - mem pgmajfault 1859 - 830 mem rss 7965265920 - - mem swap 5537792 - - net:docker0 rx 2023609029 - 2093089079 net:docker0 tx 21404100070 - 49909181906 net:docker0 tx+rx 23427709099 - 52002270985 net:eth0 rx 44750669842 67466325.07 14233805360 net:eth0 tx 2126085781 20171074.09 3670464917 net:eth0 tx+rx 46876755623 67673532.73 17904270277 time elapsed 949 - 1899 # Number of tasks: 3 # Max CPU time spent by a single task: 12653.62s # Max CPU usage in a single interval: 799.88% # Overall CPU usage: 285.70% # Max memory used by a single task: 7.97GB # Max network traffic in a single task: 46.88GB # Max network speed in a single interval: 67.67MB/s # Keep cache miss rate 0.00% # Keep cache utilization 0.00% #!! qr1hi-8i9sb-bzn6hzttfu9cetv max CPU usage was 800% -- try runtime_constraints "min_cores_per_node":8 #!! qr1hi-8i9sb-bzn6hzttfu9cetv max RSS was 7597 MiB -- try runtime_constraints "min_ram_mb_per_node":7782
Here, you can see the distinct computation between the bwa-aln and the samtools step. There is a plateau on CPU, so it could be worth it to try upgrading to a bigger node. For example, a 16 core node to see if the plateau is actually at 8 cpus or if it can scale higher.
Case study 2: FastQC
category metric task_max task_max_rate job_total blkio:0:0 read 174349211138 65352499.20 174349211138 blkio:0:0 write 0 0 0 cpu cpus 8 - - cpu sys 364.95 0.17 364.95 cpu user 17589.59 6.59 17589.59 cpu user+sys 17954.54 6.72 17954.54 fuseops read 1330241 498.40 1330241 fuseops write 0 0 0 keepcache hit 2655806 1038.00 2655806 keepcache miss 2633 1.60 2633 keepcalls get 2658439 1039.00 2658439 keepcalls put 0 0 0 mem cache 19836608512 - - mem pgmajfault 19 - 19 mem rss 1481367552 - - net:eth0 rx 178321 17798.40 178321 net:eth0 tx 7156 685.00 7156 net:eth0 tx+rx 185477 18483.40 185477 net:keep0 rx 175959092914 107337311.20 175959092914 net:keep0 tx 0 0 0 net:keep0 tx+rx 175959092914 107337311.20 175959092914 time elapsed 3301 - 3301 # Number of tasks: 1 # Max CPU time spent by a single task: 17954.54s # Max CPU usage in a single interval: 672.01% # Overall CPU usage: 543.91% # Max memory used by a single task: 1.48GB # Max network traffic in a single task: 175.96GB # Max network speed in a single interval: 107.36MB/s # Keep cache miss rate 0.10% # Keep cache utilization 99.09% #!! qr1hi-8i9sb-nxqqxravvapt10h max CPU usage was 673% -- try runtime_constraints "min_cores_per_node":7 #!! qr1hi-8i9sb-nxqqxravvapt10h max RSS was 1413 MiB -- try runtime_constraints "min_ram_mb_per_node":1945
Job Optimization¶
When to write straight to keep vs staging a file in a temporary directory and uploading after.¶
In general writing straight to keep will reap benefits. If you run crunchstat-summary --html and you see keep io stopping once in a while, then youre cpu bound. If you're seeing cpu level off and keep-read or keep-write taking too long, then you're io bound.
That being said, it's very safe for a job to write to a temporary directory then spending time to write the file to keep. On the other hand, writing straight to keep would save all the compute time of writing to keep. If you have time, it's worth trying both and seeing how much time you save by doing both. Most of the time, writing straight to keep using TaskOutputDir will be the right option, but using a tmpdir is always the safe alternative.
Choosing usually depends on how your tool works with an output directory. If its reading/writing from it a lot, then it might be worth using a temporary directory on SSD rather than going through the network. If it's just treating the output directory as a space for stdout then using TaskOutputDir should work just fine.
choosing the right number of jobs¶
Each job must output a collection, so if you don't want to output a file, then you should combine commands with each other. If you want a lot of 'checkpoints' you should have a job for each command. But the downside is more outputs. One upside to having more jobs is that you can choose nodetypes for each command. For example, BWA-mem can scale a lot better than fastqc or varscan, so having a 16 core node for something that doesn't have native multithreading would be wasteful.
choosing the right number of tasks¶
max_tasks_per_node allows you to choose how many tasks you would like to run on a machine. For example, if you have a lot of small tasks that use 1 core/1GB ram, you can put multiple of those on a bigger machine. For example, 8 tasks on an 8 core machine. If you want to utilize machines better for cost savings, you should use crunchstat-summary to find out the maximum memory/cpu usage for one task, and see if you can fit more than 1 of those on a machine. One warning, however is if you do run out of RAM (some compute nodes can't swap) your process will die with an extraneous error. Sometimes the error is obvious, sometimes its a red herring.
How to optimize the number of tasks when you don't have native multithreading¶
tools like gatk have native multithreading where you pass a -t. Here, you usually want to use that threading, and choose the min_cores_per_node. You can use any number of min_tasks_per_node making sure that your tool_threading*min_tasks_per_node is <= min_cores_per_node.
tools like varscan/freebayes don't have native multithreading so you need to find a workaround. Generally, these tools have a -L/--intervals to pass in certain loci to work on. If you have a bed file you can split reads on, then you can create a new task per interval. Then, have a job merge the outputs together.
piping between tools or writing to a tmpdir.¶
Creating pipes between tools has shown to sometimes be faster than writing/reading from disk. Feel free to pipe your tools together, for example using subprocess.PIPE in the python subprocess module. Sometimes piping is faster, sometimes it's not. You'll have to try for yourself.
Updated by Bryan Cosca over 8 years ago · 31 revisions