Story #13937

[keepstore] Move legacy metrics to prometheus /metrics endpoint

Added by Tom Clegg 7 months ago. Updated 7 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
02/13/2019
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Currently, keepstore exports some metrics in /status.json. These predate keepstore's prometheus /metrics endpoint.
  • BufferPool.BuffersAllocatedCumulative, .BuffersMax, .BuffersInUse
  • PullQueue.InProgress, .Queued
  • TrashQueue.InProgress, .Queued
  • RequestsCurrent
  • RequestsMax
  • Volumes[]
    • Label
    • Status.MountPoint, .DeviceNum, .BytesFree, .BytesUsed
    • VolumeStats.Errors, .Ops, .CompareOps, .GetOps, .PutOps, .TouchOps, .InBytes, .OutBytes
    • InternalStats.* (driver-dependent)

These should be exported as prometheus metrics instead.

There should be a transition period where metrics are exported at both /status.json and /metrics, so ops have an opportunity to migrate their monitoring/dashboard systems.


Subtasks

Task #14668: Review 13937-keepstore-prometheusIn ProgressLucas Di Pentima

History

#1 Updated by Peter Amstutz 7 months ago

http://doc.arvados.org/admin/metrics.html needs to be updated as well.

#2 Updated by Tom Morris 7 months ago

  • Tracker changed from Bug to Story
  • Target version set to To Be Groomed

#3 Updated by Tom Morris 2 months ago

  • Story points set to 2.0

#4 Updated by Tom Morris about 2 months ago

  • Target version changed from To Be Groomed to Arvados Future Sprints

#5 Updated by Tom Morris about 2 months ago

  • Target version changed from Arvados Future Sprints to 2019-01-16 Sprint

#6 Updated by Lucas Di Pentima about 2 months ago

  • Assigned To set to Lucas Di Pentima

#7 Updated by Lucas Di Pentima about 2 months ago

  • Status changed from New to In Progress

#8 Updated by Lucas Di Pentima about 1 month ago

  • Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint

#9 Updated by Lucas Di Pentima 23 days ago

  • Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint

#10 Updated by Lucas Di Pentima 11 days ago

Updates at f0553505e - branch 13937-keepstore-prometheus (WIP)

  • Exports almost all requested stats. Pending: InternalStats ErrorCodes (from statsTicker). Not sure how to implement this because I believe the number of keys would change at runtime and the metrics registry is populated once at setup time. Should I try to register all ErrorCodes keys whenever the /metrics endpoint is hit? Not sure if this is the correct approach.

Also pending: Docs updates & s3, azure driver stats

#11 Updated by Tom Clegg 11 days ago

I think CounterVec is what we're looking for here. TickErr() would look like this:

s.errCounters.WithLabelValues(errType).Add(1)

However, I'm not sure we can use this with the ConstLabels/NewGaugeFunc approach.

I'm not sure we should be using our existing ad hoc metrics as a shim here. I think it would be neater to create CounterVecs on the outside, curry them with the volume label, and pass them in to the volumes (perhaps an argument to Start()?) so the volumes can update the gauges/counters directly. Then we could delete the ad hoc metrics code entirely instead of keeping it around as an extra layer.

#12 Updated by Lucas Di Pentima 9 days ago

  • Target version changed from 2019-02-13 Sprint to 2019-02-27 Sprint

#13 Updated by Lucas Di Pentima 9 days ago

  • Story points changed from 2.0 to 1.0

#14 Updated by Lucas Di Pentima 9 days ago

Updates at f50aff88c

Refactored the previous approach to use CounterVec created within the theConfig.Start() call and passed with predefined labels to every volume .Start() call.
Also added a way for drivers to register their specific counters.

This runs ok on arvbox, but there're several tests that fail with segfaults. I would like to be sure that this approach is correct and if I should check for the metrics to be non-nil before trying to .Inc() them or some pointer on how I should be fixing those tests, as some of them don't call Start() on the volume being tested, so the metrics aren't initialized.

#15 Updated by Tom Clegg 7 days ago

This is better but I think we need to use CounterVec more to reduce complexity/repetitiveness.

For example, rather than implement our own curryWith() and getInternalCounter(), the router could have just three CounterVecs for volume metrics:
  • CounterVec "volume_operations" with labels {device_id, operation}
  • CounterVec "volume_errors" with labels {device_id, error_type}
  • CounterVec "volume_io_bytes" with labels {device_id, direction}

When starting a volume, instead of passing the prometheus registry, we would only need to pass those two CounterVecs. (We could almost curry the first label and pass a one-label vector to Start() -- however, at least in Azure's case, it looks like DeviceID() doesn't work correctly before Start() is called.)

Then, Start() would look something like this:

func (v *vol) Start(opsCounters, errCounters, ioCounters *prometheus.CounterVec) error {
        // ...
        lbls := prometheus.Labels{"device_id": v.DeviceID()}
        v.opsCounters = opsCounters.MustCurryWith(lbls)
        v.errCounters = errCounters.MustCurryWith(lbls)
        v.ioCounters = ioCounters.MustCurryWith(lbls)

Then, to use the counters:

        v.errCounters.With(prometheus.Labels{"type": errType}).Inc()
        v.ioCounters.With(prometheus.Labels{"direction": "in"}).Add(n)

I think we should just drop the "bytes used/free" gauges. Prometheus already includes a node_exporter that provides those metrics for local filesystems, and we only report dummy numbers for cloud storage volumes.

The metrics should count backend volume operations, not volume func calls. For example, instead of counting "Compare" calls, we should count "read object" and "write object". In the case of (*UnixVolume)Get(), the call stack goes Get → getWithPipe → ReadBlock → getFunc → (*osWithStats)Open → os.Open -- and then ...getFunc → countingReadWriter. We only need to count os.Open calls and I/O traffic. Rather than add new func call metrics, we just need to update the points where we already count. Something like

-        return fn(NewCountingReader(ioutil.NopCloser(f), v.os.stats.TickInBytes))
+        return fn(NewCountingReader(ioutil.NopCloser(f), v.ioCounters.With(prometheus.Labels{"direction": "in"}).Add))
-        o.stats.Tick(&o.stats.OpenOps)
+        o.opsCounters.With(prometheus.Labels{"operation": "open"}).Inc()

I don't think we need to add vector dimensions/labels for device number, volume name, etc.

Also available in: Atom PDF