Bug #10359
closed[crunchstat-summary] Limit concurrency to keep memory use under control
Description
Currently crunchstat-summary processes all components of a pipeline in parallel. This can mean hundreds of threads all competing for memory and cycles at the same time, leading to memory exhaustion in extreme cases.
We should dial this back to a reasonable number of threads for the machine and workload being processed.
Updated by Tom Morris over 8 years ago
- Assigned To set to Tom Morris
- Target version set to 2016-11-09 sprint
Updated by Tom Morris over 8 years ago
- Status changed from New to In Progress
- Target version changed from 2016-11-09 sprint to 2016-11-23 sprint
Updated by Tom Morris about 8 years ago
- Target version changed from 2016-11-23 sprint to 2016-12-14 sprint
Updated by Tom Morris about 8 years ago
- Target version changed from 2016-12-14 sprint to 2017-01-04 sprint
Updated by Tom Morris about 8 years ago
- Target version changed from 2017-01-04 sprint to 2017-01-18 sprint
Updated by Peter Amstutz about 8 years ago
$ crunchstat-summary --format html --job 962eh-8i9sb-vrfiobkau7bilws > blah.html Traceback (most recent call last): File "/home/peter/work/scripts/venv/bin/crunchstat-summary", line 4, in <module> __import__('pkg_resources').run_script('crunchstat-summary==0.1.20170105025304', 'crunchstat-summary') File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script self.require(requires)[0].run_script(script_name, ns) File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script exec(code, namespace, namespace) File "/home/peter/work/scripts/venv/lib/python2.7/site-packages/crunchstat_summary-0.1.20170105025304-py2.7.egg/EGG-INFO/scripts/crunchstat-summary", line 15, in <module> for r in cmd.report(): File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/crunchstat_summary-0.1.20170105025304-py2.7.egg/crunchstat_summary/command.py", line 65, in report yield self.summer.html_header() AttributeError: 'JobSummarizer' object has no attribute 'html_header'
$ crunchstat-summary --format text --job 962eh-8i9sb-vrfiobkau7bilws > blah.html Traceback (most recent call last): File "/home/peter/work/scripts/venv/bin/crunchstat-summary", line 4, in <module> __import__('pkg_resources').run_script('crunchstat-summary==0.1.20170105025304', 'crunchstat-summary') File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script self.require(requires)[0].run_script(script_name, ns) File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script exec(code, namespace, namespace) File "/home/peter/work/scripts/venv/lib/python2.7/site-packages/crunchstat_summary-0.1.20170105025304-py2.7.egg/EGG-INFO/scripts/crunchstat-summary", line 15, in <module> for r in cmd.report(): File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/crunchstat_summary-0.1.20170105025304-py2.7.egg/crunchstat_summary/command.py", line 60, in report yield self.summer.text_header() AttributeError: 'JobSummarizer' object has no attribute 'text_header'
This fits the story description so long as we define a "reasonable number of threads" as N=1. Parallel processing with a thread pool would be better, since the reason for having threads in the first place is that going through 100s of jobs serially means that (at ~5 seconds per job) it will take crunchstat-summary 10 minutes or more to analyze a large workflow.
Updated by Peter Amstutz about 8 years ago
An easy solution might be something like:
- Take the next N jobs
- Spin them out to N threads, wait for all of them to complete (basically the existing logic)
- yield N results
- repeat until everything is processed
Updated by Tom Morris about 8 years ago
Thanks for the quick review. I'll look at the job failure, but the cluster you used isn't familiar and doesn't seem to be resolvable via *.arvadosapi.com Where is it? I was mostly focused on pipeline instances, so it wouldn't surprise me if there were issues specific to jobs (although any bugs are likely to be in the other branch that this one depends on).
As for performance, reports for a pipeline with 370 jobs that runs 3 days and uses thousands of core hours take 11.5 minutes for text and 13.8 minutes for html, which is acceptable to me.
I have a branch with a capped number of threads, but decided the complexity wasn't warranted.
Updated by Tom Morris about 8 years ago
- Target version changed from 2017-01-18 sprint to 2017-02-01 sprint
Updated by Tom Morris about 8 years ago
- Target version changed from 2017-02-01 sprint to 2017-02-15 sprint
Updated by Tom Morris almost 8 years ago
- Target version changed from 2017-02-15 sprint to 2017-03-01 sprint
Updated by Tom Morris almost 8 years ago
- Target version changed from 2017-03-01 sprint to 2017-03-15 sprint
Updated by Radhika Chippada almost 8 years ago
- Target version changed from 2017-03-15 sprint to 2017-03-29 sprint
Updated by Tom Morris almost 8 years ago
- Target version changed from 2017-03-29 sprint to 2017-04-12 sprint
Updated by Tom Morris almost 8 years ago
- Target version changed from 2017-04-12 sprint to 2017-04-26 sprint
Updated by Tom Morris almost 8 years ago
- Target version changed from 2017-04-26 sprint to 2017-05-10 sprint
Updated by Tom Morris almost 8 years ago
- Target version changed from 2017-05-10 sprint to 2017-05-24 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-05-24 sprint to 2017-06-07 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-06-07 sprint to 2017-06-21 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-06-21 sprint to 2017-07-05 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-07-05 sprint to 2017-07-19 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-07-19 sprint to 2017-08-02 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
Updated by Tom Clegg over 7 years ago
- Subject changed from Reduce amount of parallelism in crunchstat-summary to [crunchstat-summary] Limit concurrency to keep memory use under control
- Assigned To changed from Tom Morris to Tom Clegg
- Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint
- Story points changed from 0.5 to 0.0
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Resolved
- Story points deleted (
0.0)