Feature #8016
closed[Crunch2] Report statistics
Updated by Tom Clegg over 8 years ago
Refactor crunchstat as a module, so crunch-run can start up a goroutine that monitors the correct cgroup (based on what docker client library tells us) and writes the appropriate log events/collections.
Updated by Tom Clegg over 8 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Radhika Chippada over 8 years ago
lib/crunchstat/crunchstat.go
- line 314: missing err check
- In doCPUStats, need any consideration for the case when getCPUCount return 0 (due to errors caught)?
- How / where is lastCPUSample.hasData being set?
- "If CID is empty, wait for it to appear in CIDFile. Return true if we get it before r.done indicates someone called Stop" => "If CID is empty, wait for it to appear in CIDFile. Return true if we get it before r.done which indicates someone called Stop"? Similar comment at line 409.
- Should we be returning true if CIDFile is empty and CID is not available here?
func (r *Reporter) waitForCIDFile() bool { if r.CID != "" || r.CIDFile == "" { return true }
- In Reporter struct, can we call Poll as PollInterval?
services/crunch-run/crunchrun.go
- "resource usage statistics reporting period" => can you please expand this a little? I am understadning this as resource usage stats will be reported at regular intervals given?
Updated by Tom Clegg over 8 years ago
Radhika Chippada wrote:
lib/crunchstat/crunchstat.go
- line 314: missing err check
Fixed. (The "return zero" outcome is the same but better to do it explicitly.)
- In doCPUStats, need any consideration for the case when getCPUCount return 0 (due to errors caught)?
We'll report "0 cpus" and the downstream analysis will have to cope with it.
I think this is preferable to dropping the rest of the information for this sample period. It probably means either that method never works on this system, or something terrible has happened like we're running out of file descriptors (and the times when the worker host is having trouble are probably the times when we are most eager to save all the information we can).
- How / where is lastCPUSample.hasData being set?
This was set by nextSample := cpuSample{true, ....}
. I've fixed that to use named fields, so it's less mysterious.
- nextSample := cpuSample{true, time.Now(), 0, 0, r.getCPUCount()}
var userTicks, sysTicks int64
fmt.Sscanf(string(b), "user %d\nsystem %d", &userTicks, &sysTicks)
userHz := float64(C.sysconf(C._SC_CLK_TCK))
- nextSample.user = float64(userTicks) / userHz
- nextSample.sys = float64(sysTicks) / userHz
+ nextSample := cpuSample{
+ hasData: true,
+ sampleTime: time.Now(),
+ user: float64(userTicks) / userHz,
+ sys: float64(sysTicks) / userHz,
+ cpus: r.getCPUCount(),
+ }
- "If CID is empty, wait for it to appear in CIDFile. Return true if we get it before r.done indicates someone called Stop" => "If CID is empty, wait for it to appear in CIDFile. Return true if we get it before r.done which indicates someone called Stop"? Similar comment at line 409.
This was supposed to read as "before r.done indicates (by being closed) that someone called Stop" but I see how it doesn't scan well. Updated to "...before we learn (via r.done) that someone called Stop".
- Should we be returning true if CIDFile is empty and CID is not available here?
[...]
Yes. If CIDFile is empty, there's no CIDFile to wait for (returning true just means nobody called Stop before we returned).
If both are empty, we're going to collect host stats instead of container stats.
- In Reporter struct, can we call Poll as PollInterval?
Sure. I went with PollPeriod since it's more suggestive (in case anyone cares) that we're doing "poll once every X seconds", rather than "poll, then wait X seconds, then poll again".
services/crunch-run/crunchrun.go
- "resource usage statistics reporting period" => can you please expand this a little? I am understadning this as resource usage stats will be reported at regular intervals given?
Changed to "sampling period for periodic resource usage reporting" ... better?
Now at cf1f493
Updated by Tom Clegg over 8 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:257d60253246952b435cea23b1912af80ea2c6d6.