https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422018-08-01T13:27:10ZArvadosArvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=651672018-08-01T13:27:10ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p><a class="external" href="http://doc.arvados.org/admin/metrics.html">http://doc.arvados.org/admin/metrics.html</a> needs to be updated as well.</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=651992018-08-01T15:51:21ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Idea</i></li><li><strong>Target version</strong> set to <i>To Be Groomed</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=700392018-12-19T15:17:48ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Story points</strong> set to <i>2.0</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=701492019-01-02T16:01:43ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> changed from <i>To Be Groomed</i> to <i>Arvados Future Sprints</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=701912019-01-02T17:16:09ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> changed from <i>Arvados Future Sprints</i> to <i>2019-01-16 Sprint</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=701922019-01-02T17:16:19ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Lucas Di Pentima</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=703422019-01-07T15:47:47ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=706872019-01-16T16:12:22ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2019-01-16 Sprint</i> to <i>2019-01-30 Sprint</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=709932019-01-30T14:25:11ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2019-01-30 Sprint</i> to <i>2019-02-13 Sprint</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=712222019-02-11T15:33:18ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates at <a class="changeset" title="13937: Export stats as prometheus metrics. (WIP) Includes common metrics and driver-specific (un..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f0553505e32ee00999d1d680da14260a9a0f6b99">f0553505e</a> - branch <code>13937-keepstore-prometheus</code> (WIP)</p>
<ul>
<li>Exports almost all requested stats. Pending: InternalStats ErrorCodes (from statsTicker). Not sure how to implement this because I believe the number of keys would change at runtime and the metrics registry is populated once at setup time. Should I try to register all ErrorCodes keys whenever the <code>/metrics</code> endpoint is hit? Not sure if this is the correct approach.</li>
</ul>
<p>Also pending: Docs updates & s3, azure driver stats</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=712332019-02-11T16:44:37ZTom Cleggtom@curii.com
<ul></ul><p>I think <a href="https://godoc.org/github.com/prometheus/client_golang/prometheus#CounterVec.With" class="external">CounterVec</a> is what we're looking for here. TickErr() would look like this:</p>
<pre>
s.errCounters.WithLabelValues(errType).Add(1)
</pre>
<p>However, I'm not sure we can use this with the ConstLabels/NewGaugeFunc approach.</p>
<p>I'm not sure we should be using our existing ad hoc metrics as a shim here. I think it would be neater to create CounterVecs on the outside, curry them with the volume label, and pass them in to the volumes (perhaps an argument to Start()?) so the volumes can update the gauges/counters directly. Then we could delete the ad hoc metrics code entirely instead of keeping it around as an extra layer.</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=712862019-02-13T14:22:10ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2019-02-13 Sprint</i> to <i>2019-02-27 Sprint</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=713062019-02-13T16:17:32ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Story points</strong> changed from <i>2.0</i> to <i>1.0</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=714272019-02-13T23:07:02ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates at <a class="changeset" title="13937: Adds facility for drivers to register their own counters. Used on unix_volume. All the ot..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f50aff88ccf1ce6e590a3fe98689eabef4ad292a">f50aff88c</a></p>
<p>Refactored the previous approach to use <code>CounterVec</code> created within the <code>theConfig.Start()</code> call and passed with predefined labels to every volume <code>.Start()</code> call.<br />Also added a way for drivers to register their specific counters.</p>
<p>This runs ok on <code>arvbox</code>, but there're several tests that fail with segfaults. I would like to be sure that this approach is correct and if I should check for the metrics to be non-nil before trying to <code>.Inc()</code> them or some pointer on how I should be fixing those tests, as some of them don't call <code>Start()</code> on the volume being tested, so the metrics aren't initialized.</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=714872019-02-15T19:49:05ZTom Cleggtom@curii.com
<ul></ul><p>This is better but I think we need to use CounterVec more to reduce complexity/repetitiveness.</p>
For example, rather than implement our own curryWith() and getInternalCounter(), the router could have just three CounterVecs for volume metrics:
<ul>
<li>CounterVec "volume_operations" with labels {device_id, operation}</li>
<li>CounterVec "volume_errors" with labels {device_id, error_type}</li>
<li>CounterVec "volume_io_bytes" with labels {device_id, direction}</li>
</ul>
<p>When starting a volume, instead of passing the prometheus registry, we would only need to pass those two CounterVecs. (We could almost curry the first label and pass a one-label vector to Start() -- however, at least in Azure's case, it looks like DeviceID() doesn't work correctly before Start() is called.)</p>
<p>Then, Start() would look something like this:</p>
<pre><code class="go syntaxhl"><span class="k">func</span> <span class="p">(</span><span class="n">v</span> <span class="o">*</span><span class="n">vol</span><span class="p">)</span> <span class="n">Start</span><span class="p">(</span><span class="n">opsCounters</span><span class="p">,</span> <span class="n">errCounters</span><span class="p">,</span> <span class="n">ioCounters</span> <span class="o">*</span><span class="n">prometheus</span><span class="o">.</span><span class="n">CounterVec</span><span class="p">)</span> <span class="kt">error</span> <span class="p">{</span>
<span class="c">// ...</span>
<span class="n">lbls</span> <span class="o">:=</span> <span class="n">prometheus</span><span class="o">.</span><span class="n">Labels</span><span class="p">{</span><span class="s">"device_id"</span><span class="o">:</span> <span class="n">v</span><span class="o">.</span><span class="n">DeviceID</span><span class="p">()}</span>
<span class="n">v</span><span class="o">.</span><span class="n">opsCounters</span> <span class="o">=</span> <span class="n">opsCounters</span><span class="o">.</span><span class="n">MustCurryWith</span><span class="p">(</span><span class="n">lbls</span><span class="p">)</span>
<span class="n">v</span><span class="o">.</span><span class="n">errCounters</span> <span class="o">=</span> <span class="n">errCounters</span><span class="o">.</span><span class="n">MustCurryWith</span><span class="p">(</span><span class="n">lbls</span><span class="p">)</span>
<span class="n">v</span><span class="o">.</span><span class="n">ioCounters</span> <span class="o">=</span> <span class="n">ioCounters</span><span class="o">.</span><span class="n">MustCurryWith</span><span class="p">(</span><span class="n">lbls</span><span class="p">)</span>
</code></pre>
<p>Then, to use the counters:</p>
<pre><code class="go syntaxhl"> <span class="n">v</span><span class="o">.</span><span class="n">errCounters</span><span class="o">.</span><span class="n">With</span><span class="p">(</span><span class="n">prometheus</span><span class="o">.</span><span class="n">Labels</span><span class="p">{</span><span class="s">"type"</span><span class="o">:</span> <span class="n">errType</span><span class="p">})</span><span class="o">.</span><span class="n">Inc</span><span class="p">()</span>
<span class="n">v</span><span class="o">.</span><span class="n">ioCounters</span><span class="o">.</span><span class="n">With</span><span class="p">(</span><span class="n">prometheus</span><span class="o">.</span><span class="n">Labels</span><span class="p">{</span><span class="s">"direction"</span><span class="o">:</span> <span class="s">"in"</span><span class="p">})</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
</code></pre>
<p>I think we should just drop the "bytes used/free" gauges. Prometheus already includes a node_exporter that provides those metrics for local filesystems, and we only report dummy numbers for cloud storage volumes.</p>
<p>The metrics should count backend volume operations, not volume func calls. For example, instead of counting "Compare" calls, we should count "read object" and "write object". In the case of (*UnixVolume)Get(), the call stack goes Get → getWithPipe → ReadBlock → getFunc → (*osWithStats)Open → os.Open -- and then ...getFunc → countingReadWriter. We only need to count os.Open calls and I/O traffic. Rather than add new func call metrics, we just need to update the points where we already count. Something like</p>
<pre><code class="diff syntaxhl"><span class="gd">- return fn(NewCountingReader(ioutil.NopCloser(f), v.os.stats.TickInBytes))
</span><span class="gi">+ return fn(NewCountingReader(ioutil.NopCloser(f), v.ioCounters.With(prometheus.Labels{"direction": "in"}).Add))
</span></code></pre>
<pre><code class="diff syntaxhl"><span class="gd">- o.stats.Tick(&o.stats.OpenOps)
</span><span class="gi">+ o.opsCounters.With(prometheus.Labels{"operation": "open"}).Inc()
</span></code></pre>
<p>I don't think we need to add vector dimensions/labels for device number, volume name, etc.</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=717722019-02-27T16:09:46ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2019-02-27 Sprint</i> to <i>2019-03-13 Sprint</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=718462019-02-28T19:04:38ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates at <a class="changeset" title="13937: Adds counters to azure driver. Also, generalizes counters usage and updates unix & s3 vol..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/92d1b6e05e042a0781070e7287b1ceb3e094e852">92d1b6e05</a><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/1093/">https://ci.curoverse.com/job/developer-run-tests/1093/</a></p>
<p>Following above suggestions I completed the implementations on unix, s3 & azure volume drivers.<br />If it looks good, I'll continue with tests.</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=719332019-03-01T19:33:54ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Release</strong> set to <i>15</i></li></ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=720852019-03-06T19:59:02ZTom Cleggtom@curii.com
<ul></ul><p>Yes, I think this is looking good, thanks.</p>
<p>Nit: it looks like we could pass around a single volumeMetricsVecs object in Start() calls (and the statsTicker struct) instead of repeating the list of errCounters, opsCounters, ioBytes.</p>
Prometheus recommends naming metrics with the unit at the end, so maybe
<ul>
<li>bufferpool_bytes_allocated → bufferpool_allocated_bytes</li>
<li>bufferpool_buffers_max → bufferpool_max_buffers</li>
<li>bufferpool_buffers_in_use → bufferpool_inuse_buffers</li>
<li>%s_queue_in_progress → %s_queue_inprogress_entries</li>
<li>%s_queue_queued → %s_queue_pending_entries</li>
<li>requests_current → concurrent_requests</li>
<li>requests_max → max_concurrent_requests</li>
</ul>
<p>(I'm looking at the node_exporter metrics for examples, but not sure how much time we really want to spend tweaking names...)</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=721702019-03-08T21:15:50ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates at <a class="changeset" title="13937: Adds tests for operation & I/O counters. Added {"operation": "any"} to opsCounters so tha..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/8660244c570cecdb5b4c25a3809cfe01fafebada">8660244c5</a><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/1111/">https://ci.curoverse.com/job/developer-run-tests/1111/</a></p>
<ul>
<li>Addressed simplification & naming suggestions above.</li>
<li>Added tests for <code>opsCounters</code> & <code>ioBytes</code> on the generic volume tests suite.</li>
</ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=721982019-03-11T19:52:05ZTom Cleggtom@curii.com
<ul></ul><p>I'm not sure about the "any" counter. It causes <code>sum(ops)</code> to double-count the total backend ops. Even though that's not necessarily a meaningful number, I wonder if we can solve the testing problem without affecting the real metrics at all. CounterVec doesn't let us enumerate label values directly, but perhaps we can export all metrics from a registry and parse it just enough to get the list of counters the driver is using -- something like <a class="external" href="https://godoc.org/github.com/prometheus/client_golang/prometheus#example-SummaryVec">https://godoc.org/github.com/prometheus/client_golang/prometheus#example-SummaryVec</a> ?</p>
<p>It seems like <code>GetMetricsVecs()</code> isn't really needed -- shouldn't testMetrics() be testing that the volume is updating <code>vm.ioBytes</code> using <code>{"device_id": v.DeviceID()}</code>, instead of asking the volume which CounterVecs it's updating?</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=722352019-03-12T18:22:11ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates at <a class="changeset" title="13937: Removes the 'any' operation type for a better way of testing driver labels Arvados-DCO-1...." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f4b7e7bf5b1a875db95a42781937c2a690062dd3">f4b7e7bf5</a><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/1114/">https://ci.curoverse.com/job/developer-run-tests/1114/</a></p>
Addressed above suggestions:
<ul>
<li>Don't pass around data unnecessarily</li>
<li>Don't contaminate the real metrics & get read/write operation label values from TestableVolume specific funcs.</li>
</ul> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=722412019-03-12T20:30:09ZTom Cleggtom@curii.com
<ul></ul><p>LGTM, thanks!</p> Arvados - Idea #13937: [keepstore] Move legacy metrics to prometheus /metrics endpointhttps://dev.arvados.org/issues/13937?journal_id=722432019-03-12T20:41:54ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset <a class="changeset" title="Merge branch '13937-keepstore-prometheus' Closes #13937 Arvados-DCO-1.1-Signed-off-by: Lucas Di ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/53843c61d600a0911e9f4f512b0d5c34b18f9d33">arvados|53843c61d600a0911e9f4f512b0d5c34b18f9d33</a>.</p>