Feature #16838: [a-d-c] probe metrics - Arvados

Actions

Copy link

Feature #16838

closed

[a-d-c] probe metrics

Added by Ward Vandewege over 4 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Ward Vandewege

Category:

Target version:

2020-09-23 Sprint

Story points:

Release:

Arvados 2.1.0

Release relationship:

Auto

Description

As an indicator of how healthy our cloud is:

avg runProbe duration by success/failed state (SummaryVec)

Subtasks 1 (0 open — 1 closed)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Ward Vandewege over 4 years ago

Target version changed from 2020-10-07 Sprint to 2020-09-23 Sprint
Assigned To set to Ward Vandewege
Status changed from New to In Progress

Ready for review in 799f8e333e7067cee0db0ee8bbcf45a56602d1f1 on branch 16838-probe-metrics

Actions

Copy link

Updated by Tom Clegg over 4 years ago

TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry()) at worker_test.go L242

Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?

Actions

Copy link

Updated by Ward Vandewege over 4 years ago

Tom Clegg wrote:

TestProbeAndUpdate panics in WithLabelValues -- could solve this by calling pool.registerMetrics(prometheus.NewRegistry()) at worker_test.go L242

Doh, I ran tests, but perhaps not in the correct git tree. Fixed as you suggested.

Not sure about calling Observe(0) in setup. Presumably the idea is to bring the success/fail metrics into existence early instead of waiting for the first success/failure, and this works well for gauges and counters where the initial value really is zero, but here it seems to add a fake "probe took 0 seconds" value, so metrics would always indicate that 1 probe succeeded and 1 probe failed even when nothing of the sort has happened, which seems unfortunate. I don't see a way around this, but I wonder if it would be better to drop it, and accept that prometheus will say "no data points" sometimes...?

That's fair. I've removed the Observe(0) call.

Changes at 126139084160563c2b4fe3969461c40ecbbf6951 on branch 16838-probe-metrics

Just to be sure, running all tests at developer-run-tests: #2107

Actions

Copy link