Feature #19986
closedcrunch-run tracks maximum usage of each crunchstat metric
Description
- record high-water mark for each cruchstat metric
- record high-water mark for memory usage at a higher rate (e.g. every second instead of every 10 seconds)
- report 1st time memory usage goes over 90%, 95% and 99%
- print out report of high water mark to crunch-run.txt when container ends in a form that is easy to parse
Updated by Peter Amstutz almost 2 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz almost 2 years ago
- Story points set to 2.0
- Status changed from In Progress to New
- Description updated (diff)
- Subject changed from crunch-run detects high memory usage condition and warns about it to crunch-run tracks maximum usage of each crunchstat metric
- Tracker changed from Bug to Feature
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Future to To be scheduled
Updated by Peter Amstutz almost 2 years ago
- Blocks Feature #19975: Option to re-submit container with higher memory request if previous job was killed and crunchstat shows >90% memory usage added
Updated by Peter Amstutz almost 2 years ago
- Target version changed from To be scheduled to 2023-02-15 sprint
Updated by Brett Smith almost 2 years ago
Peter Amstutz wrote:
- record high-water mark for memory usage at a higher rate (e.g. every second instead of every 10 seconds)
Right now, if I'm following right, all stats are collected at the same interval, the one provided as crunch-run -crunchstat-interval
. Is this asking to collect memory stats in a separate goroutine that ticks more often than the current all-collecting goroutine? That's the only way I can figure to interpret this that makes sense, but it seems like a fairly big change just by itself so I wanted to be sure.
If it is, is it okay for crunchstat to just report memory usage at the new interval? Or should it stick the current reporting interval, and only report an extra time if RAM use increases? (I note there'll probably be a good number of jobs that have extended periods where RAM use just goes up and up, and there won't be much difference between those two options.) Or should it only report an extra time when the number passes one of the percentage thresholds?
Should there be a command line option to control the interval for collecting RAM stats? Does it need to take any particular form? Does the default need to follow this "more often than others" guideline, or can we bump that decision up a level to whatever's invoking crunch-run?
- report 1st time memory usage goes over 90%, 95% and 99%
For the host, the cgroup, or both? For these threshold percentages, what's on each side of the fraction? I'm guessing we most likely want rss/all available RAM but I don't feel super confident about that.
- print out report of high water mark to crunch-run.txt when container ends in a form that is easy to parse
Assuming we mean "easy for machines" and not "easy for people," I would probably just mimic crunchstat's current format and do something like:
<source> max <category> <number> <stat> [<number> <stat> …]
host max ram 123456 cache 123 pgmajfault 12345 rss
cgroup max cpu 1234.56 user 12.34 sys
Let me know if that doesn't meet our needs.
Updated by Tom Clegg almost 2 years ago
Brett Smith wrote in #note-8:
Right now, if I'm following right, all stats are collected at the same interval, the one provided as
crunch-run -crunchstat-interval
. Is this asking to collect memory stats in a separate goroutine that ticks more often than the current all-collecting goroutine? That's the only way I can figure to interpret this that makes sense, but it seems like a fairly big change just by itself so I wanted to be sure.
Yes, collecting memory stats in a separate goroutine at higher frequency sounds right to me. I'm hoping/expecting the "collect mem stats" should be easy enough to extract from the "log mem stats and process sizes" func so it can be used for both.
If it is, is it okay for crunchstat to just report memory usage at the new interval? Or should it stick the current reporting interval, and only report an extra time if RAM use increases? (I note there'll probably be a good number of jobs that have extended periods where RAM use just goes up and up, and there won't be much difference between those two options.) Or should it only report an extra time when the number passes one of the percentage thresholds?
"report an extra time when the number passes one of the percentage thresholds" sounds right to me.
Should there be a command line option to control the interval for collecting RAM stats? Does it need to take any particular form? Does the default need to follow this "more often than others" guideline, or can we bump that decision up a level to whatever's invoking crunch-run?
I think 1 Hz non-configurable would be fine. (Unlike the logging interval, it doesn't have potential to saturate other parts of the system when there are lots of containers running.)
- report 1st time memory usage goes over 90%, 95% and 99%
For the host, the cgroup, or both? For these threshold percentages, what's on each side of the fraction? I'm guessing we most likely want rss/all available RAM but I don't feel super confident about that.
Just cgroup. Fraction is rss / RuntimeConstraints.RAM. I'm thinking crunchstat.Reporter gets a new field (RAMAllocated?), crunch-run populates it with RuntimeConstraints.RAM for the cgroup-only reporter, and crunchstat.Reporter enables the percentage-tracking/reporting if RAMAllocated > 0.
- print out report of high water mark to crunch-run.txt when container ends in a form that is easy to parse
Assuming we mean "easy for machines" and not "easy for people," I would probably just mimic crunchstat's current format and do something like:
<source> max <category> <number> <stat> [<number> <stat> …]
host max ram 123456 cache 123 pgmajfault 12345 rss
cgroup max cpu 1234.56 user 12.34 sys
I think the intent was "easy for people", like "maximum memory usage was 12345 bytes, 10% of 123456 allocated".
I'm tempted to ask for both human- and machine-readable. Except: I like the separation between crunchstat.txt (machine-readable place to look for all the detailed data points) and crunch-run.txt (human-readable place to look for noteworthy events) and I don't want to creep scope.
The higher frequency is really about increasing the likelihood of catching rapidly-expanding memory before it goes OOM. I think the existing stats already suffice as a machine-readable way to find out max memory usage.
On that note, I think the intent was for the percentage warnings to be in crunch-run.txt rather than crunchstat.txt. Might be simpler to implement as a 3rd crunchstat.Reporter configured to only log the memory percentage thresholds to crunch-run.txt, rather than a crunchstat.Reporter feature that starts a 2nd goroutine?
If the percentage warnings could also trigger a log flush (in case OOM does happen) I think that might be a more valuable way to creep scope.
Updated by Brett Smith almost 2 years ago
19986-crunchstat-mem-thresholds @ 3e06e042d9111970d2b58e74f51222cdd8380c31 - developer-run-tests: #3485 - I am retesting for Workbench now
I have tried to organize this branch so each commit either does one of the bullet points in the story description, or builds internal scaffolding along the way. Reading it that way might be a smidge easier? Your call, just saying, I tried.
If the percentage warnings could also trigger a log flush (in case OOM does happen) I think that might be a more valuable way to creep scope.
I did not do this because I already spent way too much time on this branch but I am cautiously optimistic it should be easy to add to ContainerRunner.logCrunchstatAlerts. If it is easy and you know how off-hand feel free to add a commit to the branch, I agree it would be a nice improvement.
Updated by Brett Smith almost 2 years ago
- Status changed from New to In Progress
Updated by Brett Smith almost 2 years ago
Workbench retest developer-run-tests-apps-workbench-functionals: #3547
Updated by Tom Clegg almost 2 years ago
It looks like the crunchstat public interface has expanded a lot just so the pretty-printing of summary stats and alerts can happen in crunch-run instead of crunchstat. I'm not really sold on that: it adds to the already large number of things crunch-run does, and it gives crunchstat a sort of split personality. I think it would be better to keep all the pretty-printing in crunchstat -- add AlertLogger and SummaryLogger fields (crunch-run would set both to CrunchLog), and flush the summary stats automatically as part of the existing Stop() interface. (This would also avoid the "if thresholds are set, caller must consume alerts channel to avoid deadlocking everything" aspect of the current setup.)
On the note of interfaces, unless I'm missing something, instead of a asking the caller to call SetupRSSThresholds() separately, we could add struct fields (AlertRSSMax and AlertThresholds?) that the caller can optionally initialize at the same time as other options.
If that refactoring is going to happen then I suppose I should stop there, except a few style things:
Go channels... You can reduce this:
for {
alert := <-runner.statReporter.ThresholdAlerts
switch alert.StatName {
case "mem rss":
runner.CrunchLog.Printf(...)
case "":
return
}
}
to this:
for alert := range runner.statReporter.ThresholdAlerts {
runner.CrunchLog.Printf(...)
}
In testRSSThresholds, use a two-value receive instead of a zero check to detect a closed channel:
- case alert := <-rep.ThresholdAlerts:
- c.Check(alert, DeepEquals, ThresholdAlert{})
+ case alert, ok := <-rep.ThresholdAlerts:
+ c.Check(ok, Equals, false, Commentf("oops, received %v", alert))
Line wrapping... the Go rule leaves a lot of room ("Go has no line length limit. Don't worry about overflowing a punched card. If a line feels too long, wrap it and indent with an extra tab.") but I don't feel ready to start a short-lines convention. Based on existing style in our code and stdlib, this sort of thing should be one line or, if you insist,
- runner.CrunchLog.Printf(
- "Maximum %s memory %s usage was %d%%, %d/%d %s",
- source, statName, percentage, value, limit, units,
- )
+ runner.CrunchLog.Printf("Maximum %s memory %s usage was %d%%, %d/%d %s",
+ source, statName, percentage, value, limit, units)
searchLogs()
seems like it could be reduced to a couple of extra chars in the regexps, and then gocheck would handle the "string was x, did not match regexp y" stuff for us:
c.Check(logs, Matches, `(?ms).*pattern-goes-here\n.*`)
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Updated by Brett Smith almost 2 years ago
Tom Clegg wrote in #note-13:
I think it would be better to keep all the pretty-printing in crunchstat -- add AlertLogger and SummaryLogger fields (crunch-run would set both to CrunchLog), and flush the summary stats automatically as part of the existing Stop() interface.
There is a new branch 19986-crunchstat-max-reporting @ 711185abcd86bed2a86e77ca24786d99323168a5 - developer-run-tests: #3521 - This is built on latest main, so no risk of conflicts with 19961.
I didn't roll summaries into Stop() directly because crunch-run wants different summaries from statReporter and hoststatReporter. I didn't see a good way to tell which one was which from the existing struct, and adding an arbitrary flag to an already large struct didn't feel like a good idea. So instead you get two summary methods that you can call after Stop(). They're pretty self-contained and straightforward, and crunch-run just calls the one it wants in different cases.
Updated by Tom Clegg almost 2 years ago
This looks good to me, thanks. I only found one thing to point out:
Just in case Statfs ever returns s.Blocks==0 for some reason, maybe it's worthwhile to avoid the possibility of a divide-by-zero here. Surely it's not supposed to happen, but still, it would be so annoying to panic and bring everything down if it does.
- if r.maxDiskSpaceSample.hasData {
+ if r.maxDiskSpaceSample.hasData && r.maxDiskSpaceSample.total > 0 {
Updated by Brett Smith almost 2 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|c1cbaf79eff608304dd4e9b91142e799a3cb60d6.