Idea #13937
closed[keepstore] Move legacy metrics to prometheus /metrics endpoint
Description
/metrics
endpoint.
- BufferPool.BuffersAllocatedCumulative, .BuffersMax, .BuffersInUse
- PullQueue.InProgress, .Queued
- TrashQueue.InProgress, .Queued
- RequestsCurrent
- RequestsMax
- Volumes[]
- Label
- Status.MountPoint, .DeviceNum, .BytesFree, .BytesUsed
- VolumeStats.Errors, .Ops, .CompareOps, .GetOps, .PutOps, .TouchOps, .InBytes, .OutBytes
- InternalStats.* (driver-dependent)
These should be exported as prometheus metrics instead.
There should be a transition period where metrics are exported at both /status.json and /metrics, so ops have an opportunity to migrate their monitoring/dashboard systems.
Updated by Peter Amstutz over 6 years ago
http://doc.arvados.org/admin/metrics.html needs to be updated as well.
Updated by Tom Morris over 6 years ago
- Tracker changed from Bug to Idea
- Target version set to To Be Groomed
Updated by Tom Morris almost 6 years ago
- Target version changed from To Be Groomed to Arvados Future Sprints
Updated by Tom Morris almost 6 years ago
- Target version changed from Arvados Future Sprints to 2019-01-16 Sprint
Updated by Lucas Di Pentima almost 6 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima almost 6 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima almost 6 years ago
- Target version changed from 2019-01-16 Sprint to 2019-01-30 Sprint
Updated by Lucas Di Pentima almost 6 years ago
- Target version changed from 2019-01-30 Sprint to 2019-02-13 Sprint
Updated by Lucas Di Pentima almost 6 years ago
Updates at f0553505e - branch 13937-keepstore-prometheus
(WIP)
- Exports almost all requested stats. Pending: InternalStats ErrorCodes (from statsTicker). Not sure how to implement this because I believe the number of keys would change at runtime and the metrics registry is populated once at setup time. Should I try to register all ErrorCodes keys whenever the
/metrics
endpoint is hit? Not sure if this is the correct approach.
Also pending: Docs updates & s3, azure driver stats
Updated by Tom Clegg almost 6 years ago
I think CounterVec is what we're looking for here. TickErr() would look like this:
s.errCounters.WithLabelValues(errType).Add(1)
However, I'm not sure we can use this with the ConstLabels/NewGaugeFunc approach.
I'm not sure we should be using our existing ad hoc metrics as a shim here. I think it would be neater to create CounterVecs on the outside, curry them with the volume label, and pass them in to the volumes (perhaps an argument to Start()?) so the volumes can update the gauges/counters directly. Then we could delete the ad hoc metrics code entirely instead of keeping it around as an extra layer.
Updated by Lucas Di Pentima almost 6 years ago
- Target version changed from 2019-02-13 Sprint to 2019-02-27 Sprint
Updated by Lucas Di Pentima almost 6 years ago
- Story points changed from 2.0 to 1.0
Updated by Lucas Di Pentima almost 6 years ago
Updates at f50aff88c
Refactored the previous approach to use CounterVec
created within the theConfig.Start()
call and passed with predefined labels to every volume .Start()
call.
Also added a way for drivers to register their specific counters.
This runs ok on arvbox
, but there're several tests that fail with segfaults. I would like to be sure that this approach is correct and if I should check for the metrics to be non-nil before trying to .Inc()
them or some pointer on how I should be fixing those tests, as some of them don't call Start()
on the volume being tested, so the metrics aren't initialized.
Updated by Tom Clegg almost 6 years ago
This is better but I think we need to use CounterVec more to reduce complexity/repetitiveness.
For example, rather than implement our own curryWith() and getInternalCounter(), the router could have just three CounterVecs for volume metrics:- CounterVec "volume_operations" with labels {device_id, operation}
- CounterVec "volume_errors" with labels {device_id, error_type}
- CounterVec "volume_io_bytes" with labels {device_id, direction}
When starting a volume, instead of passing the prometheus registry, we would only need to pass those two CounterVecs. (We could almost curry the first label and pass a one-label vector to Start() -- however, at least in Azure's case, it looks like DeviceID() doesn't work correctly before Start() is called.)
Then, Start() would look something like this:
func (v *vol) Start(opsCounters, errCounters, ioCounters *prometheus.CounterVec) error {
// ...
lbls := prometheus.Labels{"device_id": v.DeviceID()}
v.opsCounters = opsCounters.MustCurryWith(lbls)
v.errCounters = errCounters.MustCurryWith(lbls)
v.ioCounters = ioCounters.MustCurryWith(lbls)
Then, to use the counters:
v.errCounters.With(prometheus.Labels{"type": errType}).Inc()
v.ioCounters.With(prometheus.Labels{"direction": "in"}).Add(n)
I think we should just drop the "bytes used/free" gauges. Prometheus already includes a node_exporter that provides those metrics for local filesystems, and we only report dummy numbers for cloud storage volumes.
The metrics should count backend volume operations, not volume func calls. For example, instead of counting "Compare" calls, we should count "read object" and "write object". In the case of (*UnixVolume)Get(), the call stack goes Get → getWithPipe → ReadBlock → getFunc → (*osWithStats)Open → os.Open -- and then ...getFunc → countingReadWriter. We only need to count os.Open calls and I/O traffic. Rather than add new func call metrics, we just need to update the points where we already count. Something like
- return fn(NewCountingReader(ioutil.NopCloser(f), v.os.stats.TickInBytes))
+ return fn(NewCountingReader(ioutil.NopCloser(f), v.ioCounters.With(prometheus.Labels{"direction": "in"}).Add))
- o.stats.Tick(&o.stats.OpenOps)
+ o.opsCounters.With(prometheus.Labels{"operation": "open"}).Inc()
I don't think we need to add vector dimensions/labels for device number, volume name, etc.
Updated by Lucas Di Pentima almost 6 years ago
- Target version changed from 2019-02-27 Sprint to 2019-03-13 Sprint
Updated by Lucas Di Pentima almost 6 years ago
Updates at 92d1b6e05
Test run: https://ci.curoverse.com/job/developer-run-tests/1093/
Following above suggestions I completed the implementations on unix, s3 & azure volume drivers.
If it looks good, I'll continue with tests.
Updated by Tom Clegg almost 6 years ago
Yes, I think this is looking good, thanks.
Nit: it looks like we could pass around a single volumeMetricsVecs object in Start() calls (and the statsTicker struct) instead of repeating the list of errCounters, opsCounters, ioBytes.
Prometheus recommends naming metrics with the unit at the end, so maybe- bufferpool_bytes_allocated → bufferpool_allocated_bytes
- bufferpool_buffers_max → bufferpool_max_buffers
- bufferpool_buffers_in_use → bufferpool_inuse_buffers
- %s_queue_in_progress → %s_queue_inprogress_entries
- %s_queue_queued → %s_queue_pending_entries
- requests_current → concurrent_requests
- requests_max → max_concurrent_requests
(I'm looking at the node_exporter metrics for examples, but not sure how much time we really want to spend tweaking names...)
Updated by Lucas Di Pentima almost 6 years ago
Updates at 8660244c5
Test run: https://ci.curoverse.com/job/developer-run-tests/1111/
- Addressed simplification & naming suggestions above.
- Added tests for
opsCounters
&ioBytes
on the generic volume tests suite.
Updated by Tom Clegg almost 6 years ago
I'm not sure about the "any" counter. It causes sum(ops)
to double-count the total backend ops. Even though that's not necessarily a meaningful number, I wonder if we can solve the testing problem without affecting the real metrics at all. CounterVec doesn't let us enumerate label values directly, but perhaps we can export all metrics from a registry and parse it just enough to get the list of counters the driver is using -- something like https://godoc.org/github.com/prometheus/client_golang/prometheus#example-SummaryVec ?
It seems like GetMetricsVecs()
isn't really needed -- shouldn't testMetrics() be testing that the volume is updating vm.ioBytes
using {"device_id": v.DeviceID()}
, instead of asking the volume which CounterVecs it's updating?
Updated by Lucas Di Pentima almost 6 years ago
Updates at f4b7e7bf5
Test run: https://ci.curoverse.com/job/developer-run-tests/1114/
- Don't pass around data unnecessarily
- Don't contaminate the real metrics & get read/write operation label values from TestableVolume specific funcs.
Updated by Lucas Di Pentima almost 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|53843c61d600a0911e9f4f512b0d5c34b18f9d33.