Feature #8409: [crunchstat-summary] Look for signs of cache thrashing and suggest increasing keep_cache_mb_per_task - Arvados

Actions

Copy link

Feature #8409

closed

[crunchstat-summary] Look for signs of cache thrashing and suggest increasing keep_cache_mb_per_task

Added by Tom Clegg about 8 years ago. Updated almost 7 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Category:

Target version:

Story points:

Description

arv-mount logs #bytes read from it via the FUSE filesystem (blkio:0:0 read), as well as #bytes it received from keepstore in order to service those reads (net:keep0 rx). This is just my (TC) best guess so far about how to estimate cache effectiveness -- other suggestions are welcome!

Ideally, any time arv-mount retrieves a block from Keep, it's 100% useful: a user process reads every byte of the data, and that same block never gets re-fetched after expiring from the cache. In crunchstat terms, this would yield a net:keep0÷blkio:0:0 ratio of 1.

In practice, some workflows don't get anywhere near ideal: for example, if arv-mount uses a cache big enough to hold 4 blocks (which is the default), an 8-way merge sort can easily generate a sequence of 4KiB read operations that each cause a 64 MiB block to be ejected from the cache, and a new 64 MiB block (recently ejected from the cache itself) to be re-fetched. In crunchstat terms, this would yield a net:keep0÷blkio:0:0 ratio of 16384.

crunchstat-summary should compute this ratio -- perhaps just once per task, using the last sample -- and emit a warning/suggestion to increase keep_cache_mb_per_task if the ratio is too high.

The definition of "too high" is TBD. Suggest we always print the ratio, and start learning what the "normal" range looks like; we can tweak the warning threshold as we go.

Implementation¶

Suggest adding a line to _text_report_gen(), and a _recommend_keep_cache() function, to Summarizer in source:tools/crunchstat-summary/crunchstat_summary/summarizer.py

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Description updated (diff)

The distinction doesn't seem as clear as I expected in the examples I dug up.

Thrashing... 361620009773÷264432681773=1.37 (note the 2.77x write ratio too though, perhaps this isn't a good example?)

2016-02-04_02:07:26 wx7k5-8i9sb-c6s9lg0egqn8bn1 3321 2 stderr crunchstat: keepcalls 3920 put 4039090 get -- interval 10.0000 seconds 0 put 0 get
2016-02-04_02:07:27 wx7k5-8i9sb-c6s9lg0egqn8bn1 3321 2 stderr crunchstat: net:keep0 728702653454 tx 361620009773 rx -- interval 10.0000 seconds 0 tx 0 rx
2016-02-04_02:07:27 wx7k5-8i9sb-c6s9lg0egqn8bn1 3321 2 stderr crunchstat: keepcache 4033694 hit 5396 miss -- interval 10.0000 seconds 0 hit 0 miss
2016-02-04_02:07:27 wx7k5-8i9sb-c6s9lg0egqn8bn1 3321 2 stderr crunchstat: fuseops 3421144 write 2017478 read -- interval 10.0000 seconds 0 write 0 read
2016-02-04_02:07:27 wx7k5-8i9sb-c6s9lg0egqn8bn1 3321 2 stderr crunchstat: blkio:0:0 262817178823 write 264432681773 read -- interval 10.0000 seconds 0 write 0 read

Same program as above, but thrashing much less (finished much faster): 343041065773/264432681773=1.30

2016-01-10_23:28:14 wx7k5-8i9sb-my7sih9uxiziotc 31285 2 stderr crunchstat: keepcalls 3920 put 4039090 get -- interval 10.0000 seconds 1 put 0 get
2016-01-10_23:28:14 wx7k5-8i9sb-my7sih9uxiziotc 31285 2 stderr crunchstat: net:keep0 681551781191 tx 343041065773 rx -- interval 10.0000 seconds 218492231 tx 0 rx
2016-01-10_23:28:14 wx7k5-8i9sb-my7sih9uxiziotc 31285 2 stderr crunchstat: keepcache 4033971 hit 5119 miss -- interval 10.0000 seconds 0 hit 0 miss
2016-01-10_23:28:14 wx7k5-8i9sb-my7sih9uxiziotc 31285 2 stderr crunchstat: fuseops 3421148 write 2017478 read -- interval 10.0000 seconds 0 write 0 read
2016-01-10_23:28:14 wx7k5-8i9sb-my7sih9uxiziotc 31285 2 stderr crunchstat: blkio:0:0 262817178823 write 264432681773 read -- interval 10.0000 seconds 0 write 0 read

Not sure whether this job would have benefited from more than the 4GB it had, but it is known to be hard on the cache. 446211034191÷374595757156=1.19

2016-02-10_04:51:03 wx7k5-8i9sb-tpagyit3petp82w 16329 1 stderr crunchstat: keepcalls 4 put 5722680 get -- interval 10.0000 seconds 0 put 0 get
2016-02-10_04:51:03 wx7k5-8i9sb-tpagyit3petp82w 16329 1 stderr crunchstat: net:keep0 643677536 tx 446211034191 rx -- interval 10.0000 seconds 0 tx 0 rx
2016-02-10_04:51:03 wx7k5-8i9sb-tpagyit3petp82w 16329 1 stderr crunchstat: keepcache 5716027 hit 6653 miss -- interval 10.0000 seconds 0 hit 0 miss
2016-02-10_04:51:03 wx7k5-8i9sb-tpagyit3petp82w 16329 1 stderr crunchstat: fuseops 31095 write 2859452 read -- interval 10.0000 seconds 0 write 0 read
2016-02-10_04:51:03 wx7k5-8i9sb-tpagyit3petp82w 16329 1 stderr crunchstat: blkio:0:0 254729904 write 374595757156 read -- interval 10.0000 seconds 0 write 0 read

Not thrashing, 9125166075÷7573424175=1.20

2016-02-10_12:50:00 wx7k5-8i9sb-r9b27gij3vvsss6 28500 22 stderr crunchstat: keepcalls 0 put 115557 get -- interval 10.0000 seconds 0 put 104 get
2016-02-10_12:50:00 wx7k5-8i9sb-r9b27gij3vvsss6 28500 22 stderr crunchstat: net:keep0 0 tx 9125166075 rx -- interval 10.0000 seconds 0 tx 0 rx
2016-02-10_12:50:01 wx7k5-8i9sb-r9b27gij3vvsss6 28500 22 stderr crunchstat: keepcache 115417 hit 140 miss -- interval 10.0000 seconds 104 hit 0 miss
2016-02-10_12:50:01 wx7k5-8i9sb-r9b27gij3vvsss6 28500 22 stderr crunchstat: fuseops 0 write 57801 read -- interval 10.0000 seconds 0 write 52 read
2016-02-10_12:50:01 wx7k5-8i9sb-r9b27gij3vvsss6 28500 22 stderr crunchstat: blkio:0:0 0 write 7573424175 read -- interval 10.0000 seconds 0 write 6815744 read

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Description updated (diff)

Actions

Copy link

Updated by Sarah Guthrie about 8 years ago

This job is very likely to have experienced severe keep cache thrashing: https://workbench.qr2hi.arvadosapi.com/jobs/qr2hi-8i9sb-62os0hczxpduflv

Actions

Copy link

Updated by Tom Clegg about 8 years ago

8409-report-keep-cache

Fix tests ("ZeroDivisionError: float division by zero"), then LGTM

Actions

Copy link

Updated by Peter Amstutz about 8 years ago

Status changed from New to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|commit:292d026aecbc6b7e1bd12a6c5db8cc905318a992.

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Status changed from Resolved to In Progress

We might want to stick with filesystem÷network -- hits÷misses seems to vary a lot for unrelated reasons.

These jobs all have utilization >89% but miss rate varies from 0.07% to 0.36%.

align (9tee4-8i9sb-oseoqsp65i75fuq)
# Keep cache miss rate 0.16%
# Keep cache utilization 89.80%
merge (9tee4-8i9sb-ciqfax9rvphlres)
# Keep cache miss rate 0.07%
# Keep cache utilization 100.01%
combine-annotate (9tee4-8i9sb-1sq6iltk7ubglac)
# Keep cache miss rate 0.36%
# Keep cache utilization 90.10%
filter (9tee4-8i9sb-bbyfogytnepffwa)
# Keep cache miss rate 0.08%
# Keep cache utilization 94.66%

I suspect this one has low utilization because it reads only selected segments of indexed files (i.e., it's inefficient because we don't do Range requests, not because the cache is too small):

call (9tee4-8i9sb-cnpljxoiuqbu69o)
# Keep cache miss rate 0.91%
# Keep cache utilization 11.26%

When caching is perfect, hit÷miss tracks the size of the filesystem/fuse reads relative to the 64MiB Keep block size, right?

Actions

Copy link

Updated by Peter Amstutz about 8 years ago

Status changed from In Progress to New
% Done changed from 100 to 0

The miss rate tells you the proportion of get() operations that resulted in a fetch, which affects latency and throughput.

Both numbers are highly dependent on access patterns. It's possible to get 100% or even above 100% on the cache utilization (if a block is read through fuse multiple times) but it is also possible to get very low values cache efficiency with a low miss rate if it is fetching large blocks and only reading small portions of those blocks.

It's also possible that a lot of random access results in a lot of cache misses but good network efficiency.

In your last example, the network efficiency would suggest it is thrashing terribly, but the miss rate suggests the poor efficiency is due to the access pattern and lack of support for range requests.

Incidentally, here's the results from the job in note-3, which is a job that was definitely thrashing.

# Keep cache miss rate 7.85%
# Keep cache utilization 0.46%

I suggest tweaking the recommendation to put a higher ceiling on miss rate (maybe 0.5%) and also include a check on the efficiency (if the efficiency is > 90% then the miss rate is probably due to access patterns.)

Actions

Copy link

Updated by Peter Amstutz about 8 years ago

Tom Clegg wrote:

We might want to stick with filesystem÷network -- hits÷misses seems to vary a lot for unrelated reasons.

Also, it's not hits÷misses, it's hits÷gets (all calls internally to get() which may result in a cache hit or not).

Actions

Copy link

Updated by Tom Clegg about 8 years ago

Peter Amstutz wrote:

Also, it's not hits÷misses, it's hits÷gets (all calls internally to get() which may result in a cache hit or not).

According to the code it's misses÷gets, aka misses÷(hits+misses)

miss_rate = float(self.job_tot['keepcache']['miss']) / float(self.job_tot['keepcalls']['get']) * 100.0

Anyway, the issue is still that in the typical case where a process reads (long segments of) a large file, the "miss rate" is dominated by the number of times FUSE calls read() for each 64MiB data block, which is irrelevant: with perfect caching, the miss rate is either 1% or 0.1% or 0.01% depending on whether read() is getting 640KiB or 64KiB or 6.4KiB at a time, which has little discernible effect on performance (compared to unneeded network transfers). OTOH, 50% utilization always means half of the network activity was wasted (either the cache is too small, or it's just that we always read 64MiB chunks even when the caller doesn't want all of those bytes).

Given the following examples from above notes:

	miss_rate	utilization
fine	0.07%	100.01%
fine	0.16%	89.80%
fine	0.36%	90.10%
suffering	0.46%	7.85%

...I'm going to update the reporting threshold to 80% utilization.

(Both measures would false-positive the "call" example.)

Actions

Copy link

#10

Updated by Peter Amstutz about 8 years ago

Status changed from New to In Progress

You're right about the miss rate being affected by the read size.

On further consideration, I think the metric we want is "sum of sizes of unique blocks requested" vs "total bytes transferred". Then we could determine cache effectiveness (with infinite cache, every block would only need to be requested once) somewhat separately from the question if we're over-fetching due to lack of range requests (although #8228 only kicks in after a block has been requested once).

However, since it requires addition instrumentation of the python keep client, maybe that should be a separate story?

Actions

Copy link

#11

Updated by Tom Clegg about 8 years ago

Yes, I agree we need better instrumentation in order to get reliable indicators. No opinion on whether it should be part of this issue# or start a new one.

Before we get too excited about collecting precise numbers we should consider Nico's point about time. For example, if network requests have high latency and high throughput, you might well get better performance by doing larger transfers even though that reduces cache utilization.

Possible questions

How much time did callers spend waiting for data? (This divided by #bytes seems like a good indicator of how well the whole system performs.)
How much time did callers spend waiting for data that had already been fetched by the same process? (This would indicate whether the cache is big/smart enough, and establish an upper bound for the performance gain that would result from increasing the cache.)
What cache size would have resulted in 50%/90%/95%/99%/100% of optimal effectiveness, where "optimal" means no block is ever fetched twice? (Optimal might be 200% effective, for example, if each file is read twice.) It might not be too hard/expensive to calculate this, as long as we're willing to keep the list of all hashes that have ever been in the cache.
What cache/Range block size would have resulted in optimal effectiveness for the given cache size? (I don't see offhand how to calculate this easily, or without keeping an awful lot of detailed logs saved up to analyze when exiting -- but we could track how many, say, 64KiB "sectors" had actually been used to serve callers whenever we eject a block from the cache, and come up with a formula like "median + 10%", as a reasonable suggestion...)

Actions

Copy link

#12

Updated by Brett Smith about 8 years ago

Subject changed from crunchstat-summary should look for signs of cache thrashing and suggest increasing keep_cache_mb_per_task to [crunchstat-summary] Look for signs of cache thrashing and suggest increasing keep_cache_mb_per_task

Actions

Copy link

#13

Updated by Brett Smith about 8 years ago

Tom Clegg wrote:

Yes, I agree we need better instrumentation in order to get reliable indicators. No opinion on whether it should be part of this issue# or start a new one.

I'm happy to leave this open for the crunchstat-summary end (seems nice to keep the history), but can someone please make a separate story to do the necessary instrumentation to get the numbers we want?

Actions

Copy link

#14

Updated by Tom Clegg almost 7 years ago

Status changed from In Progress to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Feature #8409

[crunchstat-summary] Look for signs of cache thrashing and suggest increasing keep_cache_mb_per_task

Implementation¶

Updated by Tom Clegg about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Sarah Guthrie about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Peter Amstutz about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Peter Amstutz about 8 years ago

Updated by Peter Amstutz about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Peter Amstutz about 8 years ago

Updated by Tom Clegg about 8 years ago

Updated by Brett Smith about 8 years ago

Updated by Brett Smith about 8 years ago

Updated by Tom Clegg almost 7 years ago