Bug #10359

[crunchstat-summary] Limit concurrency to keep memory use under control

Added by Tom Morris over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/26/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Currently crunchstat-summary processes all components of a pipeline in parallel. This can mean hundreds of threads all competing for memory and cycles at the same time, leading to memory exhaustion in extreme cases.

We should dial this back to a reasonable number of threads for the machine and workload being processed.


Subtasks

Task #10379: Review 10359-crunchstat-summary-serialClosedTom Morris


Related issues

Related to Arvados - Story #11309: [Crunch2] crunchstat-summary --container UUID should summarize container logsResolved08/16/2017

Related to Arvados - Bug #12196: [crunchstat-summary] avoid opening too many files at once when working on a large container treeResolved08/30/2017

History

#1 Updated by Tom Morris over 4 years ago

  • Assigned To set to Tom Morris
  • Target version set to 2016-11-09 sprint

#2 Updated by Tom Morris over 4 years ago

  • Status changed from New to In Progress
  • Target version changed from 2016-11-09 sprint to 2016-11-23 sprint

#3 Updated by Tom Morris over 4 years ago

  • Story points set to 0.5

#4 Updated by Tom Morris over 4 years ago

  • Target version changed from 2016-11-23 sprint to 2016-12-14 sprint

#5 Updated by Tom Morris about 4 years ago

  • Target version changed from 2016-12-14 sprint to 2017-01-04 sprint

#6 Updated by Tom Morris about 4 years ago

  • Target version changed from 2017-01-04 sprint to 2017-01-18 sprint

#7 Updated by Peter Amstutz about 4 years ago

$ crunchstat-summary --format html --job 962eh-8i9sb-vrfiobkau7bilws > blah.html
Traceback (most recent call last):
  File "/home/peter/work/scripts/venv/bin/crunchstat-summary", line 4, in <module>
    __import__('pkg_resources').run_script('crunchstat-summary==0.1.20170105025304', 'crunchstat-summary')
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
    exec(code, namespace, namespace)
  File "/home/peter/work/scripts/venv/lib/python2.7/site-packages/crunchstat_summary-0.1.20170105025304-py2.7.egg/EGG-INFO/scripts/crunchstat-summary", line 15, in <module>
    for r in cmd.report():
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/crunchstat_summary-0.1.20170105025304-py2.7.egg/crunchstat_summary/command.py", line 65, in report
    yield self.summer.html_header()
AttributeError: 'JobSummarizer' object has no attribute 'html_header'
$ crunchstat-summary --format text --job 962eh-8i9sb-vrfiobkau7bilws > blah.html
Traceback (most recent call last):
  File "/home/peter/work/scripts/venv/bin/crunchstat-summary", line 4, in <module>
    __import__('pkg_resources').run_script('crunchstat-summary==0.1.20170105025304', 'crunchstat-summary')
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
    exec(code, namespace, namespace)
  File "/home/peter/work/scripts/venv/lib/python2.7/site-packages/crunchstat_summary-0.1.20170105025304-py2.7.egg/EGG-INFO/scripts/crunchstat-summary", line 15, in <module>
    for r in cmd.report():
  File "/home/peter/work/scripts/venv/local/lib/python2.7/site-packages/crunchstat_summary-0.1.20170105025304-py2.7.egg/crunchstat_summary/command.py", line 60, in report
    yield self.summer.text_header()
AttributeError: 'JobSummarizer' object has no attribute 'text_header'

This fits the story description so long as we define a "reasonable number of threads" as N=1. Parallel processing with a thread pool would be better, since the reason for having threads in the first place is that going through 100s of jobs serially means that (at ~5 seconds per job) it will take crunchstat-summary 10 minutes or more to analyze a large workflow.

#8 Updated by Peter Amstutz about 4 years ago

An easy solution might be something like:

  1. Take the next N jobs
  2. Spin them out to N threads, wait for all of them to complete (basically the existing logic)
  3. yield N results
  4. repeat until everything is processed

#9 Updated by Tom Morris about 4 years ago

Thanks for the quick review. I'll look at the job failure, but the cluster you used isn't familiar and doesn't seem to be resolvable via *.arvadosapi.com Where is it? I was mostly focused on pipeline instances, so it wouldn't surprise me if there were issues specific to jobs (although any bugs are likely to be in the other branch that this one depends on).

As for performance, reports for a pipeline with 370 jobs that runs 3 days and uses thousands of core hours take 11.5 minutes for text and 13.8 minutes for html, which is acceptable to me.

I have a branch with a capped number of threads, but decided the complexity wasn't warranted.

#10 Updated by Tom Morris about 4 years ago

  • Target version changed from 2017-01-18 sprint to 2017-02-01 sprint

#11 Updated by Tom Morris about 4 years ago

  • Target version changed from 2017-02-01 sprint to 2017-02-15 sprint

#12 Updated by Tom Morris about 4 years ago

  • Target version changed from 2017-02-15 sprint to 2017-03-01 sprint

#13 Updated by Tom Morris about 4 years ago

  • Target version changed from 2017-03-01 sprint to 2017-03-15 sprint

#14 Updated by Radhika Chippada almost 4 years ago

  • Target version changed from 2017-03-15 sprint to 2017-03-29 sprint

#15 Updated by Tom Morris almost 4 years ago

  • Target version changed from 2017-03-29 sprint to 2017-04-12 sprint

#16 Updated by Tom Morris almost 4 years ago

  • Target version changed from 2017-04-12 sprint to 2017-04-26 sprint

#17 Updated by Tom Morris almost 4 years ago

  • Target version changed from 2017-04-26 sprint to 2017-05-10 sprint

#18 Updated by Tom Morris almost 4 years ago

  • Target version changed from 2017-05-10 sprint to 2017-05-24 sprint

#19 Updated by Tom Morris almost 4 years ago

  • Target version changed from 2017-05-24 sprint to 2017-06-07 sprint

#20 Updated by Tom Morris over 3 years ago

  • Target version changed from 2017-06-07 sprint to 2017-06-21 sprint

#21 Updated by Tom Morris over 3 years ago

  • Target version changed from 2017-06-21 sprint to 2017-07-05 sprint

#22 Updated by Tom Morris over 3 years ago

  • Target version changed from 2017-07-05 sprint to 2017-07-19 sprint

#23 Updated by Tom Morris over 3 years ago

  • Target version changed from 2017-07-19 sprint to 2017-08-02 sprint

#24 Updated by Tom Morris over 3 years ago

  • Target version changed from 2017-08-02 sprint to 2017-08-16 sprint

#25 Updated by Tom Clegg over 3 years ago

  • Subject changed from Reduce amount of parallelism in crunchstat-summary to [crunchstat-summary] Limit concurrency to keep memory use under control
  • Assigned To changed from Tom Morris to Tom Clegg
  • Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint
  • Story points changed from 0.5 to 0.0

#26 Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved
  • Story points deleted (0.0)

Also available in: Atom PDF