Feature #8016

[Crunch2] Report statistics

Added by Peter Amstutz almost 6 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/30/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0
Release:
Release relationship:
Auto

Subtasks

Task #9481: Refactor crunchstat as a moduleResolvedTom Clegg

Task #9482: Use crunchstat in crunch-run to report container resource usageResolvedTom Clegg

Task #9484: Review 8016-crunchrun-crunchstatResolvedRadhika Chippada

Associated revisions

Revision 257d6025
Added by Tom Clegg about 5 years ago

Merge branch '8016-crunchrun-crunchstat'

closes #8016

Revision 6a73335d (diff)
Added by Tom Clegg about 5 years ago

8016: Rename Poll to PollPeriod. Amends c63c699aa9948f6a672536ba08e664471fb0d654.

refs #8016

History

#1 Updated by Peter Amstutz over 5 years ago

  • Assigned To set to Peter Amstutz

#2 Updated by Tom Clegg over 5 years ago

Refactor crunchstat as a module, so crunch-run can start up a goroutine that monitors the correct cgroup (based on what docker client library tells us) and writes the appropriate log events/collections.

#3 Updated by Tom Clegg over 5 years ago

  • Target version set to 2016-07-06 sprint

#4 Updated by Tom Clegg over 5 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg

#5 Updated by Tom Clegg over 5 years ago

  • Story points set to 1.0

#6 Updated by Tom Clegg about 5 years ago

  • Status changed from New to In Progress

#7 Updated by Tom Clegg about 5 years ago

8016-crunchrun-crunchstat @ c9e6477

#8 Updated by Radhika Chippada about 5 years ago

lib/crunchstat/crunchstat.go

  • line 314: missing err check
  • In doCPUStats, need any consideration for the case when getCPUCount return 0 (due to errors caught)?
  • How / where is lastCPUSample.hasData being set?
  • "If CID is empty, wait for it to appear in CIDFile. Return true if we get it before r.done indicates someone called Stop" => "If CID is empty, wait for it to appear in CIDFile. Return true if we get it before r.done which indicates someone called Stop"? Similar comment at line 409.
  • Should we be returning true if CIDFile is empty and CID is not available here?
    func (r *Reporter) waitForCIDFile() bool {
        if r.CID != "" || r.CIDFile == "" {
            return true
        }
    
  • In Reporter struct, can we call Poll as PollInterval?

services/crunch-run/crunchrun.go

  • "resource usage statistics reporting period" => can you please expand this a little? I am understadning this as resource usage stats will be reported at regular intervals given?

#9 Updated by Tom Clegg about 5 years ago

Radhika Chippada wrote:

lib/crunchstat/crunchstat.go

  • line 314: missing err check

Fixed. (The "return zero" outcome is the same but better to do it explicitly.)

  • In doCPUStats, need any consideration for the case when getCPUCount return 0 (due to errors caught)?

We'll report "0 cpus" and the downstream analysis will have to cope with it.

I think this is preferable to dropping the rest of the information for this sample period. It probably means either that method never works on this system, or something terrible has happened like we're running out of file descriptors (and the times when the worker host is having trouble are probably the times when we are most eager to save all the information we can).

  • How / where is lastCPUSample.hasData being set?

This was set by nextSample := cpuSample{true, ....}. I've fixed that to use named fields, so it's less mysterious.

-       nextSample := cpuSample{true, time.Now(), 0, 0, r.getCPUCount()}
        var userTicks, sysTicks int64
        fmt.Sscanf(string(b), "user %d\nsystem %d", &userTicks, &sysTicks)
        userHz := float64(C.sysconf(C._SC_CLK_TCK))
-       nextSample.user = float64(userTicks) / userHz
-       nextSample.sys = float64(sysTicks) / userHz
+       nextSample := cpuSample{
+               hasData:    true,
+               sampleTime: time.Now(),
+               user:       float64(userTicks) / userHz,
+               sys:        float64(sysTicks) / userHz,
+               cpus:       r.getCPUCount(),
+       }
  • "If CID is empty, wait for it to appear in CIDFile. Return true if we get it before r.done indicates someone called Stop" => "If CID is empty, wait for it to appear in CIDFile. Return true if we get it before r.done which indicates someone called Stop"? Similar comment at line 409.

This was supposed to read as "before r.done indicates (by being closed) that someone called Stop" but I see how it doesn't scan well. Updated to "...before we learn (via r.done) that someone called Stop".

  • Should we be returning true if CIDFile is empty and CID is not available here?
    [...]

Yes. If CIDFile is empty, there's no CIDFile to wait for (returning true just means nobody called Stop before we returned).

If both are empty, we're going to collect host stats instead of container stats.

  • In Reporter struct, can we call Poll as PollInterval?

Sure. I went with PollPeriod since it's more suggestive (in case anyone cares) that we're doing "poll once every X seconds", rather than "poll, then wait X seconds, then poll again".

services/crunch-run/crunchrun.go

  • "resource usage statistics reporting period" => can you please expand this a little? I am understadning this as resource usage stats will be reported at regular intervals given?

Changed to "sampling period for periodic resource usage reporting" ... better?

Now at cf1f493

#10 Updated by Radhika Chippada about 5 years ago

LGTM

#11 Updated by Tom Clegg about 5 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:257d60253246952b435cea23b1912af80ea2c6d6.

Also available in: Atom PDF