Project

General

Profile

Actions

Feature #15317

closed

[Keep] expose backend latency stats in the Prometheus metrics

Added by Ward Vandewege over 5 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Story points:
2.0
Release:
Release relationship:
Auto

Description

keepstore

  • bucketed by block size (1 MiB < 4 MiB < 8 MiB < 16 MiB < 32 MiB < 64 MiB)
  • time to first byte from backend
  • transfer speed from backend (start of request to last byte)
  • transfer speed from backend (first byte to last byte)
  • time between first byte from backend and first byte sent to client

keep-web

  • bucketed by block size (1 MiB < 4 MiB < 8 MiB < 16 MiB < 32 MiB < 64 MiB)
    • time to first byte of a block from keepstore
    • transfer speed of a block from keepstore (start of request to last byte)
    • transfer speed of a block from keepstore (first byte to last byte)
    • transfer speed of a block (from disk cache)
  • bucketed by file size (1 MiB < 10 MiB < 100 MiB < 1000 MiB)
    • time from start of request to first byte on file (cached & uncached manifest)
    • transfer speed of a whole file (start of request to last byte)
    • transfer speed of a whole file (first byte to last byte)
    • time between first byte from keepstore and first byte sent to client

Files


Subtasks 1 (0 open1 closed)

Task #21345: Review 15317-metricsResolvedTom Clegg03/12/2024Actions
Actions #1

Updated by Ward Vandewege over 5 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Ward Vandewege over 5 years ago

  • Status changed from In Progress to New
Actions #3

Updated by Tom Morris over 5 years ago

  • Description updated (diff)
  • Story points set to 2.0
Actions #4

Updated by Tom Morris over 5 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
Actions #5

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #6

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #7

Updated by Peter Amstutz about 1 year ago

  • Target version set to Development 2024-01-03 sprint
Actions #8

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-03 sprint to Development 2023-11-08 sprint
Actions #9

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Actions #10

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-11-29 sprint to Development 2024-01-03 sprint
Actions #11

Updated by Peter Amstutz about 1 year ago

  • Release deleted (60)
  • Description updated (diff)
Actions #12

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Actions #13

Updated by Peter Amstutz 12 months ago

  • Assigned To set to Tom Clegg
Actions #14

Updated by Peter Amstutz 11 months ago

  • Status changed from New to In Progress
Actions #15

Updated by Tom Clegg 11 months ago

  • File download-backend-speed-example.png added

Median upload sync latency (time between receiving last byte of data from client and sending response header -- i.e., time to flush the last data block(s) and update the collection record) for files 100M and larger:

histogram_quantile(0.5, sum(rate(arvados_keepweb_upload_sync_delay_seconds_bucket{size_range="100M"}[5m])) by (le))

Limiting backend speed when downloading files: I'm not sure this is the ideal name for it, but this statistic represents the speed of transferring data from keep/cache into keep-web, as computed by subtracting the time spent waiting to send data to the client. When the client is slow, this "effective speed" can be higher than the maximum achievable transfer speed and is more or less meaningless. But when the client is receiving data fast enough that the keep/cache backend can't keep up, this stat should approach the actual bottleneck speed.

The gist is that when this statistic is low (say 50 MB/s) it's always the fault of the software/network on the Arvados side. Unlike a normal download-speed metric, low readings can never be attributed to slow clients.

histogram_quantile(0.5, sum(rate(arvados_keepweb_download_limiting_backend_speed_bucket{size_range="100M"}[5m])) by (le))
Example after some repeated downloads, all over 100M:
  1. downloading to localhost, fully cached -- median limiting backend speed 1 GB/s
  2. downloading to localhost, with cache misses (downloading a file that's bigger than the entire disk cache) -- median limiting backend speed around 80 MB/s

Example after downloading small files (cloning a git repo) from a well packed collection to a remote client, showing
  • 5th percentile limiting backend speed ~50 KB/s (the slowest responses were cache misses)
  • median limiting backend speed ~500 KB/s (most responses were cache hits)

Actions #16

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #17

Updated by Tom Clegg 11 months ago

  • File deleted (download-backend-speed-example.png)
Actions #18

Updated by Tom Clegg 11 months ago

Actions #19

Updated by Tom Clegg 11 months ago

  • File deleted (download-limiting-backend-speed-0.png)
Actions #21

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Actions #22

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #23

Updated by Tom Clegg 10 months ago

15317-metrics @ 6d868844fae9f52455b2e1954382bf9e4bb66c22 -- developer-run-tests: #4060

  • All agreed upon points are implemented / addressed.
    • Adds metrics for the limiting backend speed for file content retrieval, as viewed by keep-web, bucketed by file size (see note above).
    • I expect the forthcoming streaming/prefetch improvements we've been working on (#2960, #18961) will show up on the new "limiting backend speed" metrics.
    • Adds overall upload/download speed metrics bucketed by file size, which may help us focus subsequent investigations/optimizations based on real-world performance
    • Adds upload sync delay metric (not bucketed by file size because it's expected to be a function of manifest size more than file size)
    • These aren't exactly the metrics mentioned in the description, which as discussed offline is ok.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • I think we want to leave this issue open to add keepstore backend metrics like upload/download speed bucketed by block size. This should surely wait until #2960 is merged.
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • Automated tests ✅. Manual tests, see note above.
  • Documentation has been updated.
    • New prometheus metrics have help text.
  • Behaves appropriately at the intended scale (describe intended scale).
    • Shouldn't affect scale. Tracking code adds some minimal overhead per chunk uploaded/downloaded.
  • Considered backwards and forwards compatibility issues between client and server.
    • N/A
  • Follows our coding standards and GUI style guidelines.
Actions #24

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #25

Updated by Brett Smith 10 months ago

  • Spitballing the name, what do you think about "apparent" or "relative," like:
    Name: "download_apparent_backend_speed",
    Help: "Apparent download speed from the backend (bytes per second) when serving file downloads to the client. Low values suggest client download speeds could be improved if keep-web could download data from the backend faster."

    I'm still not in love with it. The main thing I'm trying to get at is to more explicitly hint at the caveat in your comments, that the number can be affected by client transfer speed too.
    I kind of wonder if the fact that this is hard to describe suggests that this isn't the most helpful metric to report. I understand it gives some useful data, but that data is hard to interpret: you need additional information (which we apparently don't track) to understand whether or not the number is meaningful. Is there a clearer way we could present this?
  • In sizeRange, should the comparisons be < rather than <=? Honest question, I don't know the answer, I think it kind of depends on reporting conventions in both Prometheus and our own metrics.
  • Nit, no blank line after the func newDownloadTracker defintion, I don't know if we have a style rule about this but it's at least unusual.
  • Nit, the bottom of TestMetrics has the comment // Dump entire metrics output in test logs written twice. It feels like it should be there at most once, and I wonder if you could avoid the need for it entirely by using a more descriptive variable name than buf. I also wonder whether we have any conventions about whether to log data like this before running checks on it, or afterwards.
Actions #26

Updated by Tom Clegg 9 months ago

Brett Smith wrote in #note-25:

  • Spitballing the name, what do you think about "apparent" or "relative," like:[...]

I like "apparent" better than limiting, so I changed to that.

Is there a clearer way we could present this?

I've updated the help text to link to #note-25 above, which should at least discourage from trying to interpret the data just based on the metric name/help.

  • In sizeRange, should the comparisons be < rather than <=? Honest question, I don't know the answer, I think it kind of depends on reporting conventions in both Prometheus and our own metrics.

Yes, I think it makes more sense to have an exactly 1M file counted in the 1M category. Fixed.

  • Nit, no blank line after the func newDownloadTracker defintion, I don't know if we have a style rule about this but it's at least unusual.

Oops, gofmt usually fixes things like this, but it allows either 0 or 1 blank lines between funcs, so this is mismatched for no reason. Fixed.

  • Nit, the bottom of TestMetrics has the comment // Dump entire metrics output in test logs written twice. It feels like it should be there at most once, and I wonder if you could avoid the need for it entirely by using a more descriptive variable name than buf. I also wonder whether we have any conventions about whether to log data like this before running checks on it, or afterwards.

Improved variable name and deleted comments.

15317-metrics @ ae7a179649bbe19761c49de65b10127592ef646b -- developer-run-tests: #4072

Actions #27

Updated by Brett Smith 9 months ago

Tom Clegg wrote in #note-26:

15317-metrics @ ae7a179649bbe19761c49de65b10127592ef646b -- developer-run-tests: #4072

This basically looks good to me except:

I've updated the help text to link to #note-25 above

-1 to this. That comment was never intended to be documentation, and the information people are looking for is like one or two sentences in one bullet in the middle of a larger conversation with a lot of irrelevant stuff. Expecting anyone to read it seems user-hostile.

What if we started expanding doc/admin/metrics.html.textile.liquid? We could have a section for each service with a /metrics endpoint, and then a subsection to document each metric in more detail. I wouldn't expect this branch to backfill that whole thing, just start enough structure to document this endpoint, and establish the pattern for future documentation.

Failing that: #note-15 would be a better link. It is mostly about this and goes into more detail about how to interpret the number, with examples. Maybe go back and edit it to reflect the new name?

Thanks.

Actions #28

Updated by Tom Clegg 9 months ago

Oops, I meant to link to #note-15, note #note-25, but much of that criticism applies anyway. Updated to link to WebDAV performance metrics instead, and put a slightly updated version of the note-15 explanation there.

Actions #29

Updated by Tom Clegg 9 months ago

  • Status changed from In Progress to Resolved
Actions #30

Updated by Peter Amstutz 8 months ago

  • Release set to 70
Actions

Also available in: Atom PDF