Feature #16838
closed[a-d-c] probe metrics
Description
As an indicator of how healthy our cloud is:
- avg runProbe duration by success/failed state (SummaryVec)
Related issues
Updated by Ward Vandewege about 4 years ago
- Target version changed from 2020-10-07 Sprint to 2020-09-23 Sprint
- Assigned To set to Ward Vandewege
- Status changed from New to In Progress
Ready for review in 799f8e333e7067cee0db0ee8bbcf45a56602d1f1 on branch 16838-probe-metrics
Updated by Tom Clegg about 4 years ago
TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry())
at worker_test.go L242
Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?
Updated by Ward Vandewege about 4 years ago
Tom Clegg wrote:
TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling
pool.registerMetrics(prometheus.NewRegistry())
at worker_test.go L242
Doh, I ran tests, but perhaps not in the correct git tree. Fixed as you suggested.
Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?
That's fair. I've removed the Observe(0) call.
Changes at 126139084160563c2b4fe3969461c40ecbbf6951 on branch 16838-probe-metrics
Just to be sure, running all tests at developer-run-tests: #2107
Updated by Ward Vandewege about 4 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|dbaa58c1c806799435ca5741b3f61bc6b65b6569.
Updated by Ward Vandewege almost 4 years ago
- Related to Feature #16636: [arvados-dispatch-cloud] Add instance metrics added