Feature #16838

[a-d-c] probe metrics

Added by Ward Vandewege about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/18/2020
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

As an indicator of how healthy our cloud is:

  • avg runProbe duration by success/failed state (SummaryVec)

Subtasks

Task #16845: review 16838-probe-metricsResolvedWard Vandewege


Related issues

Related to Arvados - Feature #16636: [arvados-dispatch-cloud] Add instance metricsResolved08/03/2020

Associated revisions

Revision dbaa58c1
Added by Ward Vandewege about 1 year ago

Merge branch '16838-probe-metrics' into master

closes #16838

Arvados-DCO-1.1-Signed-off-by: Ward Vandewege <>

History

#1 Updated by Ward Vandewege about 1 year ago

  • Target version changed from 2020-10-07 Sprint to 2020-09-23 Sprint
  • Assigned To set to Ward Vandewege
  • Status changed from New to In Progress

Ready for review in 799f8e333e7067cee0db0ee8bbcf45a56602d1f1 on branch 16838-probe-metrics

#2 Updated by Tom Clegg about 1 year ago

TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry()) at worker_test.go L242

Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?

#3 Updated by Ward Vandewege about 1 year ago

Tom Clegg wrote:

TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry()) at worker_test.go L242

Doh, I ran tests, but perhaps not in the correct git tree. Fixed as you suggested.

Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?

That's fair. I've removed the Observe(0) call.

Changes at 126139084160563c2b4fe3969461c40ecbbf6951 on branch 16838-probe-metrics

Just to be sure, running all tests at https://ci.arvados.org/view/Developer/job/developer-run-tests/2107/

#4 Updated by Tom Clegg about 1 year ago

LGTM, thanks!

#5 Updated by Ward Vandewege about 1 year ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

#6 Updated by Peter Amstutz about 1 year ago

  • Release set to 25

#7 Updated by Ward Vandewege 11 months ago

  • Related to Feature #16636: [arvados-dispatch-cloud] Add instance metrics added

Also available in: Atom PDF