Story #7901

[Crunch] Script to report maximum resource utilization from a job log

Added by Brett Smith over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
12/02/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Use case: A bioinformatician is cost-optimizing a functional pipeline. They want to know whether the individual jobs in the pipeline are making efficient use of the compute node(s) available to it.

Provide a script in source:tools/crunchstat-summary/bin/crunchstat-summary to help answer that question. It should take a job UUID and/or job log filename as an argument, parse the crunchstat lines in the corresponding log, and report the maximum point-in-time reading for each of the following resources:

  • CPU used
    • Report a single top-style figure like "1600%" (meaning full utilization of 16 cores)
    • Don't bother with separate user/sys reporting; figuring out a good way to report those figures at maximum utilization will be a separate story if needed
  • RAM used
  • Swap used
  • net throughput for each net device mentioned
  • IO throughput for each blk device mentioned

Essentially, print the maximum value of every statistic ever printed by crunchstat. The max value before the "-- interval" delimiter will indicate maximum total usage by any one task. The max value of stats that come after the "-- interval" delimiter will indicate maximum throughput / CPU load.

When selecting the max value of an interval/throughput value, note the actual interval too. For example:
  • crunchstat: cpu 33.0700 user 3.9600 sys 8 cpus -- interval 10.0003 seconds 33.0700 user 3.9600 sys
    crunchstat: cpu 66.1500 user 7.9600 sys 8 cpus -- interval 11.0003 seconds 33.0800 user 4.0000 sys
    ---
    max cpu 66.1400 user 3.9600 sys 8 cpus
    max cpu interval 11.003 seconds 37.0800 user+sys = 337%
    

It's understood that the script can't report whether the underlying tool would run just as well with fewer resources. It's only reporting whether the tool is using the resources it has.

Some example crunchstat output:
  • crunchstat: mem 57286656 cache 0 swap 27 pgmajfault 1510322176 rss
    crunchstat: cpu 33.0700 user 3.9600 sys 8 cpus -- interval 10.0003 seconds 29.8600 user 3.4900 sys
    crunchstat: blkio:202:16 0 write 3665920 read -- interval 10.0003 seconds 0 write 0 read
    crunchstat: net:eth0 3301 tx 153030 rx -- interval 10.0003 seconds 0 tx 0 rx
    

Simplifying restrictions

No need (yet) to publish a package. Do add enough setup.py stuff to run tests, though.

If the intervals vary a lot, the max figure reported won't necessarily be the max rate. But we don't expect intervals to vary enough to matter, so we'll live with the discrepancy.


Subtasks

Task #7915: Add some test data filesResolvedTom Clegg

Task #7916: Write crunchstat-summaryResolvedTom Clegg

Task #7917: Review 7901-crunchstat-summaryResolvedTom Clegg

Task #8027: Review 7901-human-summaryResolvedTom Clegg

Associated revisions

Revision bbed885a
Added by Tom Clegg over 3 years ago

Merge branch '7901-crunchstat-summary' refs #7901

Revision bbed885a
Added by Tom Clegg over 3 years ago

Merge branch '7901-crunchstat-summary' refs #7901

Revision 242b1cc2
Added by Tom Clegg over 3 years ago

Merge branch '7901-crunchstat-summary' refs #7901

Revision 6248a5d1 (diff)
Added by Brett Smith over 3 years ago

7901: Fix variable name typo in crunchstat-summary.

Refs #7901.

Revision 0d66f5f5
Added by Tom Clegg over 3 years ago

Merge branch '7901-human-summary' closes #7901

History

#1 Updated by Tom Clegg over 3 years ago

  • Description updated (diff)

#2 Updated by Tom Clegg over 3 years ago

  • Description updated (diff)

#3 Updated by Tom Clegg over 3 years ago

  • Description updated (diff)

#4 Updated by Tom Clegg over 3 years ago

  • Description updated (diff)

#5 Updated by Brett Smith over 3 years ago

  • Story points set to 1.0

#6 Updated by Brett Smith over 3 years ago

  • Assigned To set to Tom Clegg

#7 Updated by Tom Clegg over 3 years ago

  • Status changed from New to In Progress

#8 Updated by Tom Clegg over 3 years ago

7901-crunchstat-summary @ 1d4f047 and arvados-dev|575f751

$ crunchstat-summary --job 4xphq-8i9sb-jq0ekny1xou3zoh | expand -t12
category    metric      max         max_rate
net:eth0    rx          1754364530  41658344.87
net:eth0    tx          38837956    920817.97
mem         rss         349814784   -
mem         cache       1678139392  -
mem         swap        0           -
mem         pgmajfault  0           -
keepcalls   put         0           0.00
keepcalls   get         0           0.00
keepcache   miss        0           0.00
keepcache   hit         0           0.00
blkio:0:0   write       0           0.00
blkio:0:0   read        0           0.00
net:keep0   rx          0           0.00
net:keep0   tx          0           0.00
cpu         sys         1.92        0.04
cpu         user        3.83        0.09
cpu         cpus        8           -
fuseops     write       0           0.00
fuseops     read        0           0.00

Or, equivalently,

$ zcat ~/arvados/tools/crunchstat-summary/tests/logfile_20151204190335.txt.gz | crunchstat-summary

or

$ crunchstat-summary --log-file ~/arvados/tools/crunchstat-summary/tests/logfile_20151204190335.txt.gz

#9 Updated by Tom Clegg over 3 years ago

another example

$ crunchstat-summary --job qr1hi-8i9sb-qwr5epuobcre09m | expand -t12
category    metric      max         max_rate
mem         cache       15702978560 -
mem         pgmajfault  315         -
mem         rss         10732777472 -
cpu         sys         152.38      0.11
cpu         user        29305.28    2.55
cpu         cpus        8           -
net:eth0    rx          44485200    649806.51
net:eth0    tx          6898753050  74613976.17

#10 Updated by Brett Smith over 3 years ago

Reviewing 1d4f047.

I assume the answer is no, but just being extra-safe: I don't recognize the collection.py Crunch script from anywhere. Should we be concerned about the PDHs that appear in those logs, privacy-wise? If so, we might need to aggressively garbage collect the commit from the Git server. I double-checked that all the API tokens have been expired, and they have, so that's reassuring.

In Summarizer._logdata, you slipped into Ruby mind and raise bare strings a few times. That's a bug:

>>> raise "foo" 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: exceptions must be old-style classes or derived from BaseException, not str

I suggest ValueError as the most Pythonic exception for both these cases, but I'm not wedded to it.

Nothing else is critical or would block a merge, but a few readability suggestions:

There a few single-letter variables lurking around: j and c in Summarizer._logdata, m in Summarizer.run, and s in the test method. They're all innocuous now, but code grows…

Suggest str.endswith('.ext') over str[-4:] == '.ext'. This comes up in Summarizer.run, Summarizer._logdata, and the test method. Not writing the length yourself is DRYer.

Anything you can do with printf-style strings, you can also do with format strings, if you like:

  • {!r} to write an object's repr, as in Summarizer.run.
  • {:.2f} to specify precision for a float.

There are some unused imports that could be cleaned up: itertools and os in summarizer.py. fnmatch in the tests. (Did you want to import glob and make your loop for fnm in glob.glob(os.path.join(dirname, '*.txt.gz'))?)

Props for for val, stat in zip(words[::2], words[1::2]), that was a fun line of code to read.

Thanks.

#11 Updated by Tom Clegg over 3 years ago

Brett Smith wrote:

I assume the answer is no, but just being extra-safe: I don't recognize the collection.py Crunch script from anywhere. Should we be concerned about the PDHs that appear in those logs, privacy-wise? If so, we might need to aggressively garbage collect the commit from the Git server. I double-checked that all the API tokens have been expired, and they have, so that's reassuring.

This is from the Keep performance tests on 4xphq, from #7780. The input is LobSTR reference data. (Sorry, probably could have saved you some time by mentioning where those examples came from.)

In Summarizer._logdata, you slipped into Ruby mind and raise bare strings a few times. That's a bug:

Oops, fixed. ValueError it is.

There a few single-letter variables lurking around: j and c in Summarizer._logdata, m in Summarizer.run, and s in the test method. They're all innocuous now, but code grows…

fixed → job, collection, summarizer.

Suggest str.endswith('.ext') over str[-4:] == '.ext'. This comes up in Summarizer.run, Summarizer._logdata, and the test method. Not writing the length yourself is DRYer.

Ah yes, that's better.

Anything you can do with printf-style strings, you can also do with format strings, if you like:

  • {!r} to write an object's repr, as in Summarizer.run.
  • {:.2f} to specify precision for a float.

Thanks. Updated these so we're just using format() everywhere.

There are some unused imports that could be cleaned up: itertools and os in summarizer.py. fnmatch in the tests. (Did you want to import glob and make your loop for fnm in glob.glob(os.path.join(dirname, '*.txt.gz'))?)

Yes! I started out hoping to use fnmatch and didn't go back and clean this up. glob() is better.

Props for for val, stat in zip(words[::2], words[1::2]), that was a fun line of code to read.

I merely selected it from various approaches on stackexchange... but yes, I agree. :)

rebased, now at 40158e8

#12 Updated by Tom Clegg over 3 years ago

7901-human-summary @ cf4c50a

category        metric          task_max        task_max_rate   job_total
blkio:0:0       read            0               0.00            0
blkio:0:0       write           0               0.00            0
cpu             cpus            8               -               -
cpu             sys             1.92            0.04            1.92
cpu             user            3.83            0.09            3.83
cpu             user+sys        5.75            0.13            5.75
fuseops         read            0               0.00            0
fuseops         write           0               0.00            0
keepcache       hit             0               0.00            0
keepcache       miss            0               0.00            0
keepcalls       get             0               0.00            0
keepcalls       put             0               0.00            0
mem             cache           1678139392      -               -
mem             pgmajfault      0               -               0
mem             rss             349814784       -               -
mem             swap            0               -               -
net:eth0        rx              1754364530      41658344.87     1754364530
net:eth0        tx              38837956        920817.97       38837956
net:eth0        tx+rx           1793202486      42579162.83     1793202486
net:keep0       rx              0               0.00            0
net:keep0       tx              0               0.00            0
net:keep0       tx+rx           0               0.00            0
time            elapsed         80              -               80
# Max CPU time spent by a single task: 5.75s
# Max CPU usage in a single interval: 13.00%
# Overall CPU usage: 7.19%
# Max memory used by a single task: 0.35GB
# Max network traffic in a single task: 1.79GB
# Max network speed in a single interval: 42.58MB/s

#13 Updated by Brett Smith over 3 years ago

Reviewing 7901-human-summary @ cf4c50a.

This is good to merge. One thing that might improve readability is using collections.defaultdict for some of the deep dictionaries. This can save you from doing lots of key in dict and dict.setdefault guarding. I think the two variable declarations that would pay off most (with the right imports) are:

task_stats = collections.defaultdict(functools.partial(collections.defaultdict, dict))
job_tot = collections.defaultdict(functools.partial(collections.defaultdict, int))

And then this is borderline trivial, but in the for args in… loop in _report_gen, using tuples for the args instead of lists would be a little more Pythonic. There's a lot of convention that tuples are good for fixed-length, heterogeneous sequences; where lists are good for variable-length, homogeneous sequences.

Thanks.

#14 Updated by Tom Clegg over 3 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:0d66f5f5c5173f0faad6318a9ac87d11964e5748.

Also available in: Atom PDF