Project

General

Profile

Actions

Bug #21639

closed

Python client keep cache is inefficient

Added by Peter Amstutz about 1 month ago. Updated 6 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Story points:
-
Release relationship:
Auto

Description

I was reviewing the code for possible bottlenecks and noticed that the keep cache is just a plain list which has to be iterated over every time. As cache sizes get to 100s of gigabytes, this list will have 1000s of entries.


Subtasks 2 (0 open2 closed)

Task #21643: Review 21639-keep-cache-dictResolvedPeter Amstutz04/04/2024Actions
Task #21656: Review 21639-prefetchResolvedPeter Amstutz04/09/2024Actions
Actions #1

Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith about 1 month ago

This may or may not be helpful, just spitballing: the stdlib includes a heapq module to access a list as a priority queue. If you're able to model the list elements in the way the module needs (see the Implementation Notes), this would let you improve search and pop performance without dramatically changing the data structure.

Actions #3

Updated by Peter Amstutz about 1 month ago

  • Status changed from New to In Progress
Actions #4

Updated by Peter Amstutz about 1 month ago

on main, with a warm cache (all block are available on disk).

$ dd if=~/work/tmp/keep/by_id/db8dbff526d908903f75787c0d4e98e9+36204/ERR2122553_1.fastq.gz of=/dev/null bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 27.899 s, 385 MB/s

On 21639-keep-cache-dict:

$ dd if=~/work/tmp/keep/by_id/db8dbff526d908903f75787c0d4e98e9+36204/ERR2122553_1.fastq.gz of=/dev/null bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 12.6507 s, 849 MB/s
Actions #5

Updated by Peter Amstutz about 1 month ago

21639-keep-cache-dict @ bd04725444b7ab15ea81647c519a580ca50a94de

developer-run-tests: #4115

Improve the read() fast path when the block is in cache.

For every read that triggers a block fetch, there's 500+ reads from a cached block.

Any unnecessary work done in those 500+ reads significantly impacts throughput.

I used profiling to determine where it was spending its time, and found the following bottlenecks.

This improves throughput the following ways:

  • It turned out using ".tobytes()" on the memoryview slice was creating a copy. Returning the slice makes it zero-copy.
  • The overhead of determining whether to prefetch blocks on every read was significant. The fast path is faster with prefetching disabled.
  • The overhead of recomputing cache usage on every read was significant. Now it keeps a running total of cache usage so it only has to compare a single number.
  • Now uses ordereddict instead of a list to implement the LRU cache. This means amortized constant time instead of linear time for search, insertion and deletion in the cache.
Actions #6

Updated by Peter Amstutz about 1 month ago

md5sum out of Keep, fully cached:

real    0m53.382s
user    0m42.649s
sys    0m8.848s
d2d8fc2f43ad11ed17641f79941cae34  /home/peter/work/tmp/keep/by_id/db8dbff526d908903f75787c0d4e98e9+36204/ERR2122553_1.fastq.gz

md5sum on local file system (laptop SSD):

$ time md5sum /home/peter/work/tmp/ERR2122553_1.fastq.gz
d2d8fc2f43ad11ed17641f79941cae34  /home/peter/work/tmp/ERR2122553_1.fastq.gz

real    0m43.361s
user    0m33.827s
sys    0m7.807s

md5sum seems to be CPU-limited.

reading file out of Keep with dd, fully cached:

$ dd if=~/work/tmp/keep/by_id/db8dbff526d908903f75787c0d4e98e9+36204/ERR2122553_1.fastq.gz of=/dev/null bs=1M count=30240
24907+1 records in
24907+1 records out
26117417855 bytes (26 GB, 24 GiB) copied, 41.736 s, 626 MB/s

reading file from local filesystem (laptop SSD):

$ dd if=~/work/tmp/ERR2122553_1.fastq.gz of=/dev/null bs=1M count=30240
24907+1 records in
24907+1 records out
26117417855 bytes (26 GB, 24 GiB) copied, 25.0461 s, 1.0 GB/s
Actions #7

Updated by Peter Amstutz about 1 month ago

another run from Keep, I think a lot more data had to be mapped back in this time:

$ dd if=~/work/tmp/keep/by_id/db8dbff526d908903f75787c0d4e98e9+36204/ERR2122553_1.fastq.gz of=~/work/tmp/ERR2122553_1.fastq.gz bs=1M count=30240
24907+1 records in
24907+1 records out
26117417855 bytes (26 GB, 24 GiB) copied, 174.424 s, 150 MB/s

From local filesystem again.

$ dd if=~/work/tmp/ERR2122553_1.fastq.gz of=/dev/null bs=1M count=30240
24907+1 records in
24907+1 records out
26117417855 bytes (26 GB, 24 GiB) copied, 23.6229 s, 1.1 GB/s

Actions #8

Updated by Peter Amstutz about 1 month ago

adding madvise(mmap.MADV_WILLNEED), this tells the kernel we're planning to use the whole block, so we get fewer page faults:

$ dd if=~/work/tmp/keep/by_id/db8dbff526d908903f75787c0d4e98e9+36204/ERR2122553_1.fastq.gz of=~/work/tmp/ERR2122553_1.fastq.gz bs=1M count=30240
24907+1 records in
24907+1 records out
26117417855 bytes (26 GB, 24 GiB) copied, 64.5269 s, 405 MB/s

EDIT: I messed this up, this is writing to a file instead of /dev/null. When I send the results to /dev/null it is more like 550-650 MB/s

Actions #9

Updated by Peter Amstutz about 1 month ago

21639-keep-cache-dict @ 8215e5931aeed29d82e94b5129a685e0b2f084c3

developer-run-tests: #4116

  • Added madvise(mmap.MADV_WILLNEED), this seems to significantly improve performance by advising the kernel to map in the whole block, reducing page faults
Actions #10

Updated by Peter Amstutz about 1 month ago

21639-keep-cache-dict @ 10564bc0d78545b71a5815c3332403f2e8a0d562

developer-run-tests: #4118

  • All agreed upon points are implemented / addressed.
    • measurably improves performance when the source data is primarily in the local disk cache
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • passes tests, see above comments for benchmarks
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • improves scale by delivering data to client applications faster
  • Considered backwards and forwards compatibility issues between client and server.
    • n/a
  • Follows our coding standards and GUI style guidelines.
    • yes

Improve the read() fast path when the block is in cache.

For every read that triggers a block fetch, there's 500+ reads from a cached block.

Any unnecessary work done in those 500+ reads significantly impacts throughput.

I used profiling to determine where it was spending its time.

This improves throughput the following ways:

  • It turned out using ".tobytes()" on the memoryview slice was creating a copy. Returning the slice makes it zero-copy.
  • The overhead of determining whether to prefetch blocks on every read was significant. The fast path is faster with prefetching disabled.
  • The overhead of recomputing cache usage on every read was significant. Now it keeps a running total of cache usage so it only has to compare a single number.
  • Now uses ordereddict instead of a list to implement the LRU cache. This means amortized constant time instead of linear time for search, insertion and deletion in the cache.
  • Added madvise(mmap.MADV_WILLNEED), this seems to significantly improve performance by advising the kernel to map in the whole block, reducing page faults

From benchmarking:

  • delivers data at 500-600 MB/s when blocks need to be re-read from SSD
  • delivers data at 800 MB/s when blocks are mostly already in RAM, although I've gotten up to 1.7 GiB/s over multiple trials when the kernel starts returning cached file data before even asking FUSE for it.
Actions #12

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-10:

21639-keep-cache-dict @ 10564bc0d78545b71a5815c3332403f2e8a0d562

  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • passes tests, see above comments for benchmarks

No, the tests are not passing. Please address that.

In KeepClient.__init__:

+        if num_prefetch_threads is not None:
+            self.num_prefetch_threads = num_prefetch_threads
+        else:
+            self.num_prefetch_threads = 2

Question for discussion: should we just make 0 the default? I grant it's not immediately obvious we should: clients like arv-get are gonna have different access patterns than FUSE. But like you said at standup, we're seeing a pattern that prefetch is generally not worth the trouble. If we have no evidence that prefetch helps arv-get either, I think we should at least consider making the change at the SDK level.

If you don't feel comfortable doing it now, that's fine, but maybe we should have a follow-up ticket to investigate and discuss?

At the bottom of DiskCacheSlot.set:

+        return False

As far as I can tell this is unreachable. Either the try block will return True, or it'll raise an exception. Since there are no except blocks, we have no way to get to this point.

If I'm following right, I'd suggest removing this, mainly as a readability thing to avoid the red herring for future readers. You might consider moving the return True out of the try block to here to reinforce the point, but that's more subjective and I don't feel strongly about it either way.

In KeepBlockCache.set:

+                    sm = sum([st.size() for st in self._cache.values()])

Since you're touching this line anyway you might as well remove the brackets and just pass the raw generator expression here. If we're aiming to improve performance, no need to build a list comprehension just to immediately throw away the list.

Same PEP 8 comment about empty sequences I made about the FUSE branch applies here too. In DiskCacheSlot.evict, you can just write this as if not self.content:

+        if self.content is None or len(self.content) == 0:
+            return

In KeepBlockCache._resize_cache, the first half of the condition:

+        while len(_evict_candidates) > 0 and (self.cache_total > cache_max or len(self._cache) > max_slots):
Actions #14

Updated by Peter Amstutz about 1 month ago

Brett Smith wrote in #note-12:

Peter Amstutz wrote in #note-10:

21639-keep-cache-dict @ 10564bc0d78545b71a5815c3332403f2e8a0d562

  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • passes tests, see above comments for benchmarks

No, the tests are not passing. Please address that.

Should be passing now.

If you don't feel comfortable doing it now, that's fine, but maybe we should have a follow-up ticket to investigate and discuss?

Yea, I don't want to mess with it without additional investigation. Added #21647

If I'm following right, I'd suggest removing this, mainly as a readability thing to avoid the red herring for future readers. You might consider moving the return True out of the try block to here to reinforce the point, but that's more subjective and I don't feel strongly about it either way.

I removed it.

Since you're touching this line anyway you might as well remove the brackets and just pass the raw generator expression here. If we're aiming to improve performance, no need to build a list comprehension just to immediately throw away the list.

Done.

21639-keep-cache-dict @ 8090c845fea8d9532be5c4347f06dcf9fa7c868c

developer-run-tests: #4128

Actions #15

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-14:

21639-keep-cache-dict @ 8090c845fea8d9532be5c4347f06dcf9fa7c868c

developer-run-tests: #4128

This lgtm, thanks.

I have one sort of general long-term maintenance worry, it's not really specific to this branch and I don't have a solution at hand. Since our code still contains a lot of older-Python compatibility stuff, it would be easy for new readers to guess that every instance of collections.OrderedDict is there just to get ordering, basically like I did in the last review. I wish we had an easy way to flag "this is a recent, intentional choice." Sometimes using move_to_end will catch it in tests, but not always, like in DiskCacheSlot.init_cache here. And I don't want you to write the justification comment you did in the last branch every single time, that's not reasonable.

Just as one idea, what if we gave it a type hint? e.g.,

cachelist: collections.OrderedDict = collections.OrderedDict()

That obviously never ran in Python 2.7, and we're asserting that it must specifically be an OrderedDict, not just a dict or Mapping or anything looser. It's still not foolproof, but it's a simple signal and forward-looking.

Actions #16

Updated by Peter Amstutz about 1 month ago

  • Status changed from In Progress to Resolved
Actions #17

Updated by Peter Amstutz about 1 month ago

  • Status changed from Resolved to Feedback

2.7.2~rc3 https://workbench.pirca.arvadosapi.com/processes/pirca-xvhdp-zyv7bm0tl3lm2nv

2.7.1~rc3 https://workbench.pirca.arvadosapi.com/processes/pirca-xvhdp-6sncuiyzzoaa5r2

There is a small, but I think real, performance regression here. If you compare the individual steps side-by-side, the 2.7.2~rc3 run is fairly consistently some seconds slower than the 2.7.1~rc3 run (although the overall workflow runtime is almost exactly the same!)

Even more interesting, if you go back and look at previous runs, you find that 2.5.x - 2.7.1 times are pretty consistent (since disk caching was enabled), the 2.4.x times are terrible (ostensibly this was when prefetching was bug fixed, and it did measurably improve other workflows, but it seems to have actually made this particular workflow worse) and then the times for 2.3.3~rc4 (https://workbench.pirca.arvadosapi.com/processes/pirca-xvhdp-f0syhaoqy7jfqg2) are the most similar to 2.7.2~rc3. In other words, the times are most consistent with a previous version of arv-mount where prefetching was known not to be working as intended.

The reason I disabled prefetching was that it measurably impacted performance on the high end when blocks were already in cache. However, I think I've come up with a strategy that minimizes overhead while still getting the benefits of prefetching:

21639-prefetch @ 14088c5bd88ede15698ba8f5b8efbf08e18374ce

Indeed, this may even be a bit faster on the high end as well by using prefetch to encourage the kernel to map in on-disk blocks that are about to be used.

developer-run-tests: #4135

One thing that's annoying about this is that a full benchmark integration test requires updating arv-mount on the compute node.

Actions #18

Updated by Peter Amstutz about 1 month ago

21639-prefetch @ 3832c780144d8491403582d9d1142c150fde6d1f

developer-run-tests: #4137

Wrap around read counter, have slightly less indirection. Add a comment about the "every 128 read ops" logic.

Actions #19

Updated by Peter Amstutz about 1 month ago

ok, in order to compare performance I cherry picked these commits on to 2.7.2~rc4 and then updated the config on pirca to run an init script that upgrades arv-mount (so it would get 2.7.2~rc4 arv-mount without needing to build a new compute image). I'll unwind the cheat when the full release is ready.

In the mean time, here is exact same workflow, but arv-mount should have the "prefetch every 128 reads" behavior.

https://workbench.pirca.arvadosapi.com/processes/pirca-xvhdp-lhq2wcebm53s32k

Actions #20

Updated by Peter Amstutz about 1 month ago

Comparing bwa mem side by side, 2.7.2~rc4 build now almost exactly matches 2.7.1, both of which are better than 2.7.2~rc3.

I also looked at crunchstat-summary for these three builds, and while 2.7.1 and 2.7.2~rc4 have a nice smooth curve when loading data (ramps up, plateaus around 150 MBps, then ramps down) the 2.7.2~rc3 curve has humps (goes up, then down, then up again, then down) and also doesn't peak as high.

So this data shows that arv-mount does measurably benefit from prefetching.

Now, based on our experience with keep-web, it is entirely possible that when arv-mount is paired with the new streaming keepstore, prefetching will be less important.

Actions #21

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-18:

21639-prefetch @ 3832c780144d8491403582d9d1142c150fde6d1f

So my main question is, where do the numbers come from? The comments use an example of 128KiB reads leading to prefetch every 16MiB, which makes sense and is fine as far as it goes, but I'm left wondering: is there some reason to believe 128KiB is a typical or average read size? And even assuming the case we're optimizing for is reading a bunch of 64MiB blocks in sequence (which sure, seems reasonable), is there a specific reason to believe prefetching four blocks per block read is a good rate?

If we had unlimited resources and time it seems like we'd ideally instrument a bunch of common bioinformatics tools and try them with different moduli and look for a sweet spot. Obviously we don't have that. But what do we informing these numbers? I'm guessing there's at least a little intuition, and that's fine, the logic seems clear enough I don't think we need a full empirical study to make a change. But I think for future development it would help to have a fuller understanding of what informed these numbers, even when that source is intuition, so we know where there might be room for improvement in the future.

I think automated tests like this would be good:

  • Make sure ArvadosFile doesn't try to prefetch when the Keep client isn't configured for it.
  • Make sure ArvadosFile prefetches a block on the first read.
  • Given a sufficiently large number of reads, assert 1 < prefetched_blocks_count < num_reads.

I think you could write these flexibly enough that they don't need to care about exactly how often we prefetch. You can check and assert the behavior is there without specific numbers.

It would be a nice readability improvement if _read_counter was an iterator that tracked its own state rather than a raw number needing to be manipulated in methods. You could also take advantage of this to do some checks once in __init__ rather than in every call to readfrom. e.g., ArvadosFile.__init__ could say:

if self.parent._my_block_manager().prefetch_lookahead:
    self._read_counter = itertools.cycle(range(128))
else:
    self._read_counter = itertools.repeat(1)

and then the new block in ArvadosFile.readfrom gets simplified to:

if next(self._read_counter) == 0:
    prefetch = locators_and_ranges(...)
else:
    prefetch = None

Thanks.

Actions #22

Updated by Peter Amstutz about 1 month ago

Brett Smith wrote in #note-21:

Peter Amstutz wrote in #note-18:

21639-prefetch @ 3832c780144d8491403582d9d1142c150fde6d1f

So my main question is, where do the numbers come from? The comments use an example of 128KiB reads leading to prefetch every 16MiB, which makes sense and is fine as far as it goes, but I'm left wondering: is there some reason to believe 128KiB is a typical or average read size?

    def _fuse_options(self):
        """FUSE mount options; see mount.fuse(8)""" 
        opts = [optname for optname in ['allow_other', 'debug']
                if getattr(self.args, optname)]
        # Increase default read/write size from 4KiB to 128KiB
        opts += ["big_writes", "max_read=131072"]
        if self.args.subtype:
            opts += ["subtype="+self.args.subtype]
        return opts

https://manpages.debian.org/testing/fuse/mount.fuse.8.en.html#max_read=N

"With this option the maximum size of read operations can be set. The default is infinite. Note that the size of read requests is limited anyway to 32 pages (which is 128kbyte on i386)."

And even assuming the case we're optimizing for is reading a bunch of 64MiB blocks in sequence (which sure, seems reasonable), is there a specific reason to believe prefetching four blocks per block read is a good rate?

It's not prefetching four blocks, it's checking for readahead four times per block, the default is only to read ahead two blocks (prefetch_lookahead = self._keep.num_prefetch_threads, and the default for num_prefetch_threads is 2) so what would normally happen is the 1st read also triggers the prefetch and the next 3 prefetche checks don't do anything because the next two blocks are already in flight or in cache. However I'm trying to balance with the possibility of smaller blocks, larger reads, seeks, etc so

If we had unlimited resources and time it seems like we'd ideally instrument a bunch of common bioinformatics tools and try them with different moduli and look for a sweet spot. Obviously we don't have that. But what do we informing these numbers? I'm guessing there's at least a little intuition, and that's fine, the logic seems clear enough I don't think we need a full empirical study to make a change. But I think for future development it would help to have a fuller understanding of what informed these numbers, even when that source is intuition, so we know where there might be room for improvement in the future.

As noted above, this is mainly motivated by the size of buffers used to communicate to the kernel, not so much the behavior of individual bioinformatics tools.

I think automated tests like this would be good:

  • Make sure ArvadosFile doesn't try to prefetch when the Keep client isn't configured for it.
  • Make sure ArvadosFile prefetches a block on the first read.
  • Given a sufficiently large number of reads, assert 1 < prefetched_blocks_count < num_reads.

I think you could write these flexibly enough that they don't need to care about exactly how often we prefetch. You can check and assert the behavior is there without specific numbers.

There's ArvadosFileReaderTestCase.test_prefetch, let me see how much work it is to expand the tests a bit.

It would be a nice readability improvement if _read_counter was an iterator that tracked its own state rather than a raw number needing to be manipulated in methods. You could also take advantage of this to do some checks once in __init__ rather than in every call to readfrom. e.g., ArvadosFile.__init__ could say:

I am disinclined to do this because the original reason for the change in behavior is optimization, and using itertools seems slightly more complex (in terms of internal implementation, the iterator protocol would involve multiple method calls) -- it's hard to get any lighter weight than two integer operations.

Actions #23

Updated by Peter Amstutz about 1 month ago

21639-prefetch @ 4ca460797b3b2c80141524b34399171749b4d482

Added some additional prefetch tests

developer-run-tests: #4143

Actions #24

Updated by Brett Smith about 1 month ago

Test rerun developer-run-tests: #4144

All the numbers explanation makes sense. It would be great to feed some of that back into the comment in the code. Even just pointing out that 128KiB is the max_read size we set elsewhere, and therefore a 128-read cycle gives us a 2:1 "check for prefetch:do prefetch" ratio, would be a lot of helpful context.

The new tests look good, thanks for that. Unfortunately there are other test failures that need to be addressed:

======================================================================
FAIL: test_inode_cleared (tests.test_inodes.InodeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/workspace/developer-run-tests-doc-pysdk-api-fuse/services/fuse/tests/test_inodes.py", line 87, in test_inode_cleared
    self.assertEqual(600, cache.total())
AssertionError: 600 != 1100

----------------------------------------------------------------------

I am disinclined to do this because the original reason for the change in behavior is optimization, and using itertools seems slightly more complex (in terms of internal implementation, the iterator protocol would involve multiple method calls) -- it's hard to get any lighter weight than two integer operations.

Fine. I'll just say, even thinking only about optimization and no other considerations, I don't think it's obvious that "pure" math on integers will necessarily be faster than iteration. But it's not such a big deal that I'm willing to do the microbenchmark.

Actions #26

Updated by Peter Amstutz 29 days ago

  • Status changed from Feedback to Resolved
Actions #27

Updated by Peter Amstutz 6 days ago

  • Release set to 69
Actions

Also available in: Atom PDF