Feature #19986: crunch-run tracks maximum usage of each crunchstat metric - Arvados

Actions

Copy link

Feature #19986

closed

crunch-run tracks maximum usage of each crunchstat metric

Added by Peter Amstutz about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Brett Smith

Category:

Crunch

Target version:

2023-03-01 sprint

Story points:

2.0

Release:

Arvados 2.6.0

Release relationship:

Auto

Description

record high-water mark for each cruchstat metric
record high-water mark for memory usage at a higher rate (e.g. every second instead of every 10 seconds)
report 1st time memory usage goes over 90%, 95% and 99%
print out report of high water mark to crunch-run.txt when container ends in a form that is easy to parse

Subtasks 1 (1 open — 0 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Story points set to 2.0
Status changed from In Progress to New
Description updated (diff)
Subject changed from crunch-run detects high memory usage condition and warns about it to crunch-run tracks maximum usage of each crunchstat metric
Tracker changed from Bug to Feature

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Target version changed from Future to To be scheduled

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Blocks Feature #19975: Option to re-submit container with higher memory request if previous job was killed and crunchstat shows >90% memory usage added

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Category set to Crunch

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Target version changed from To be scheduled to 2023-02-15 sprint

Actions

Copy link

Updated by Peter Amstutz about 2 years ago

Assigned To set to Brett Smith

Actions

Copy link

Updated by Brett Smith about 2 years ago

Peter Amstutz wrote:

record high-water mark for memory usage at a higher rate (e.g. every second instead of every 10 seconds)

Right now, if I'm following right, all stats are collected at the same interval, the one provided as crunch-run -crunchstat-interval. Is this asking to collect memory stats in a separate goroutine that ticks more often than the current all-collecting goroutine? That's the only way I can figure to interpret this that makes sense, but it seems like a fairly big change just by itself so I wanted to be sure.

If it is, is it okay for crunchstat to just report memory usage at the new interval? Or should it stick the current reporting interval, and only report an extra time if RAM use increases? (I note there'll probably be a good number of jobs that have extended periods where RAM use just goes up and up, and there won't be much difference between those two options.) Or should it only report an extra time when the number passes one of the percentage thresholds?

Should there be a command line option to control the interval for collecting RAM stats? Does it need to take any particular form? Does the default need to follow this "more often than others" guideline, or can we bump that decision up a level to whatever's invoking crunch-run?

report 1st time memory usage goes over 90%, 95% and 99%

For the host, the cgroup, or both? For these threshold percentages, what's on each side of the fraction? I'm guessing we most likely want rss/all available RAM but I don't feel super confident about that.

print out report of high water mark to crunch-run.txt when container ends in a form that is easy to parse

Assuming we mean "easy for machines" and not "easy for people," I would probably just mimic crunchstat's current format and do something like:

<source> max <category> <number> <stat> [<number> <stat> …]
host max ram 123456 cache 123 pgmajfault 12345 rss
cgroup max cpu 1234.56 user 12.34 sys

Let me know if that doesn't meet our needs.

Actions

Copy link

Updated by Tom Clegg about 2 years ago

Brett Smith wrote in #note-8:

Right now, if I'm following right, all stats are collected at the same interval, the one provided as crunch-run -crunchstat-interval. Is this asking to collect memory stats in a separate goroutine that ticks more often than the current all-collecting goroutine? That's the only way I can figure to interpret this that makes sense, but it seems like a fairly big change just by itself so I wanted to be sure.

Yes, collecting memory stats in a separate goroutine at higher frequency sounds right to me. I'm hoping/expecting the "collect mem stats" should be easy enough to extract from the "log mem stats and process sizes" func so it can be used for both.

If it is, is it okay for crunchstat to just report memory usage at the new interval? Or should it stick the current reporting interval, and only report an extra time if RAM use increases? (I note there'll probably be a good number of jobs that have extended periods where RAM use just goes up and up, and there won't be much difference between those two options.) Or should it only report an extra time when the number passes one of the percentage thresholds?

"report an extra time when the number passes one of the percentage thresholds" sounds right to me.

Should there be a command line option to control the interval for collecting RAM stats? Does it need to take any particular form? Does the default need to follow this "more often than others" guideline, or can we bump that decision up a level to whatever's invoking crunch-run?

I think 1 Hz non-configurable would be fine. (Unlike the logging interval, it doesn't have potential to saturate other parts of the system when there are lots of containers running.)

report 1st time memory usage goes over 90%, 95% and 99%

For the host, the cgroup, or both? For these threshold percentages, what's on each side of the fraction? I'm guessing we most likely want rss/all available RAM but I don't feel super confident about that.

Just cgroup. Fraction is rss / RuntimeConstraints.RAM. I'm thinking crunchstat.Reporter gets a new field (RAMAllocated?), crunch-run populates it with RuntimeConstraints.RAM for the cgroup-only reporter, and crunchstat.Reporter enables the percentage-tracking/reporting if RAMAllocated > 0.

print out report of high water mark to crunch-run.txt when container ends in a form that is easy to parse

Assuming we mean "easy for machines" and not "easy for people," I would probably just mimic crunchstat's current format and do something like:

<source> max <category> <number> <stat> [<number> <stat> …]
host max ram 123456 cache 123 pgmajfault 12345 rss
cgroup max cpu 1234.56 user 12.34 sys

I think the intent was "easy for people", like "maximum memory usage was 12345 bytes, 10% of 123456 allocated".

I'm tempted to ask for both human- and machine-readable. Except: I like the separation between crunchstat.txt (machine-readable place to look for all the detailed data points) and crunch-run.txt (human-readable place to look for noteworthy events) and I don't want to creep scope.

The higher frequency is really about increasing the likelihood of catching rapidly-expanding memory before it goes OOM. I think the existing stats already suffice as a machine-readable way to find out max memory usage.

On that note, I think the intent was for the percentage warnings to be in crunch-run.txt rather than crunchstat.txt. Might be simpler to implement as a 3rd crunchstat.Reporter configured to only log the memory percentage thresholds to crunch-run.txt, rather than a crunchstat.Reporter feature that starts a 2nd goroutine?

If the percentage warnings could also trigger a log flush (in case OOM does happen) I think that might be a more valuable way to creep scope.

Actions

Copy link

#10

Updated by Brett Smith about 2 years ago

19986-crunchstat-mem-thresholds @ 3e06e042d9111970d2b58e74f51222cdd8380c31 - developer-run-tests: #3485 - I am retesting for Workbench now

I have tried to organize this branch so each commit either does one of the bullet points in the story description, or builds internal scaffolding along the way. Reading it that way might be a smidge easier? Your call, just saying, I tried.

If the percentage warnings could also trigger a log flush (in case OOM does happen) I think that might be a more valuable way to creep scope.

I did not do this because I already spent way too much time on this branch but I am cautiously optimistic it should be easy to add to ContainerRunner.logCrunchstatAlerts. If it is easy and you know how off-hand feel free to add a commit to the branch, I agree it would be a nice improvement.

Actions

Copy link

#11

Updated by Brett Smith about 2 years ago

Status changed from New to In Progress

Actions

Copy link

#12

Updated by Brett Smith about 2 years ago

Workbench retest developer-run-tests-apps-workbench-functionals: #3547

Actions

Copy link

#13

Updated by Tom Clegg about 2 years ago

It looks like the crunchstat public interface has expanded a lot just so the pretty-printing of summary stats and alerts can happen in crunch-run instead of crunchstat. I'm not really sold on that: it adds to the already large number of things crunch-run does, and it gives crunchstat a sort of split personality. I think it would be better to keep all the pretty-printing in crunchstat -- add AlertLogger and SummaryLogger fields (crunch-run would set both to CrunchLog), and flush the summary stats automatically as part of the existing Stop() interface. (This would also avoid the "if thresholds are set, caller must consume alerts channel to avoid deadlocking everything" aspect of the current setup.)

On the note of interfaces, unless I'm missing something, instead of a asking the caller to call SetupRSSThresholds() separately, we could add struct fields (AlertRSSMax and AlertThresholds?) that the caller can optionally initialize at the same time as other options.

If that refactoring is going to happen then I suppose I should stop there, except a few style things:

Go channels... You can reduce this:

        for {
                alert := <-runner.statReporter.ThresholdAlerts
                switch alert.StatName {
                case "mem rss":
                        runner.CrunchLog.Printf(...)
                case "":
                        return
                }
        }

to this:

        for alert := range runner.statReporter.ThresholdAlerts {
                runner.CrunchLog.Printf(...)
        }

In testRSSThresholds, use a two-value receive instead of a zero check to detect a closed channel:

-        case alert := <-rep.ThresholdAlerts:
-                c.Check(alert, DeepEquals, ThresholdAlert{})
+        case alert, ok := <-rep.ThresholdAlerts:
+                c.Check(ok, Equals, false, Commentf("oops, received %v", alert))

Line wrapping... the Go rule leaves a lot of room ("Go has no line length limit. Don't worry about overflowing a punched card. If a line feels too long, wrap it and indent with an extra tab.") but I don't feel ready to start a short-lines convention. Based on existing style in our code and stdlib, this sort of thing should be one line or, if you insist,

-               runner.CrunchLog.Printf(
-                       "Maximum %s memory %s usage was %d%%, %d/%d %s",
-                       source, statName, percentage, value, limit, units,
-               )
+               runner.CrunchLog.Printf("Maximum %s memory %s usage was %d%%, %d/%d %s",
+                       source, statName, percentage, value, limit, units)

searchLogs() seems like it could be reduced to a couple of extra chars in the regexps, and then gocheck would handle the "string was x, did not match regexp y" stuff for us:

        c.Check(logs, Matches, `(?ms).*pattern-goes-here\n.*`)

Actions

Copy link

#14

Updated by Peter Amstutz about 2 years ago

Release set to 57

Actions

Copy link

#15

Updated by Peter Amstutz about 2 years ago

Target version changed from 2023-02-15 sprint to 2023-03-01 sprint

Actions

Copy link

#16

Updated by Brett Smith about 2 years ago

Tom Clegg wrote in #note-13:

I think it would be better to keep all the pretty-printing in crunchstat -- add AlertLogger and SummaryLogger fields (crunch-run would set both to CrunchLog), and flush the summary stats automatically as part of the existing Stop() interface.

There is a new branch 19986-crunchstat-max-reporting @ 711185abcd86bed2a86e77ca24786d99323168a5 - developer-run-tests: #3521 - This is built on latest main, so no risk of conflicts with 19961.

I didn't roll summaries into Stop() directly because crunch-run wants different summaries from statReporter and hoststatReporter. I didn't see a good way to tell which one was which from the existing struct, and adding an arbitrary flag to an already large struct didn't feel like a good idea. So instead you get two summary methods that you can call after Stop(). They're pretty self-contained and straightforward, and crunch-run just calls the one it wants in different cases.

Actions

Copy link

#17

Updated by Tom Clegg about 2 years ago

This looks good to me, thanks. I only found one thing to point out:

Just in case Statfs ever returns s.Blocks==0 for some reason, maybe it's worthwhile to avoid the possibility of a divide-by-zero here. Surely it's not supposed to happen, but still, it would be so annoying to panic and bring everything down if it does.

-       if r.maxDiskSpaceSample.hasData {
+       if r.maxDiskSpaceSample.hasData && r.maxDiskSpaceSample.total > 0 {

Actions

Copy link

#18

Updated by Brett Smith about 2 years ago

% Done changed from 0 to 100
Status changed from In Progress to Resolved

Applied in changeset arvados|c1cbaf79eff608304dd4e9b91142e799a3cb60d6.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Feature #19986

crunch-run tracks maximum usage of each crunchstat metric

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Brett Smith about 2 years ago

Updated by Tom Clegg about 2 years ago

Updated by Brett Smith about 2 years ago

Updated by Brett Smith about 2 years ago

Updated by Brett Smith about 2 years ago

Updated by Tom Clegg about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Brett Smith about 2 years ago

Updated by Tom Clegg about 2 years ago

Updated by Brett Smith about 2 years ago