Project

General

Profile

Actions

Feature #18842

closed

Local disk keep cache for Python SDK/arv-mount

Added by Peter Amstutz about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Story points:
-
Release relationship:
Auto

Description

Option to cache blocks to local/high performance file system for situations where a large amount of data will be read multiple times or read with random access and storing a significant portion of the blocks in memory is infeasible.


Files


Subtasks 2 (0 open2 closed)

Task #19450: Review 18842-arv-mount-disk-cacheResolvedPeter Amstutz10/21/2022Actions
Task #19638: Review 18842-arv-mount-disk-configResolvedPeter Amstutz10/21/2022Actions

Related issues

Related to Arvados - Bug #19981: Containers that used an old DefaultKeepCacheRAM no longer get reused after a configuration changeNew03/05/2023Actions
Is duplicate of Arvados - Idea #3640: [SDKs] Add runtime option to SDKs (esp Python and arv-mount) to use a filesystem directory block cache as an alternative to RAM cache.ClosedActions
Actions #2

Updated by Peter Amstutz about 2 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 2 years ago

  • Is duplicate of Idea #3640: [SDKs] Add runtime option to SDKs (esp Python and arv-mount) to use a filesystem directory block cache as an alternative to RAM cache. added
Actions #4

Updated by Peter Amstutz over 1 year ago

  • Target version set to 2022-09-14 sprint
Actions #5

Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Peter Amstutz
Actions #6

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-09-14 sprint to 2022-10-12 sprint
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Actions #8

Updated by Peter Amstutz over 1 year ago

18842-arv-mount-disk-cache @ 67e56f190b9a78e3c45cc7d90510fc631e0d04b6

  • Add DiskCacheSlot which takes a block, writes it to disk, then returns a memory mapped object
  • this "just works" with the existing zero copy approach based on memoryview
  • uses atomic write/rename and organizes blocks in directories the same way that keepstore does it
  • if a block already exists on disk (but not in the list of cached/mmaped blocks) it will read it and check the md5
  • sets own cache limits based on process open file descriptor limit & available disk space
  • from a warm start where the blocks exist on disk but arv-mount is freshly restarted, data is served very quickly (and avoids re-downloading blocks)
  • I originally thought of having an 4 GiB default limit, I bumped it up to 8 GiB, I'm still
  • cap_cache deletes blocks from disk that it knows about if it is exceeding the cache limits

A couple details I haven't quite decided on:

  • the cache directory is shared between processes, this enables them to share each other's blocks but each one managing space independently could also cause them to delete each other's blocks. this won't crash anything but it could be inefficient
  • right now it also doesn't do anything with the cache directory when arv-mount starts or stops, it just leaves the blocks in place, so if you access different things every time, it could still grow forever

mulling this over, since I need to keep the files open anyhow, I can do something with file locks.

I'll follow up with some rough benchmarks

developer-run-tests: #3326

Actions #9

Updated by Peter Amstutz over 1 year ago

18842-arv-mount-disk-cache @ 356fa1d4d17c402470a6b49ae943e7ec278b1a46

  • Maps in everything at start and prunes any excess blocks.

developer-run-tests: #3329

TODO: figure out what config.yml options need to be added & make crunch-run use them

Actions #10

Updated by Peter Amstutz over 1 year ago

  • Status changed from New to In Progress
Actions #11

Updated by Peter Amstutz over 1 year ago

18842-arv-mount-disk-config @ 858a8ce2394b57540f95fa794e894e3315508c8b

  • Adds keep_cache_disk option to runtime_constraints, and DefaultKeepCacheRAM and DefaultKeepCacheDisk configurations
  • crunch-run uses disk cache if keep_cache_disk > 0, otherwise you get the RAM cache
    • This maintains backwards compatibility with older versions of arvados-cwl-runner. If disk cache isn't specified in the request, you get the old behavior
    • Changed the config.yml default to use disk cache
  • Incorporates disk cache in the disk space estimate for the request
  • Updated tests

developer-run-tests: #3342

Actions #12

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Actions #13

Updated by Tom Clegg over 1 year ago

Should we make it a config error to have ram>0 and disk>0? currently in that case we reserve both, but don't use the ram.

The upgrade instructions don't explicitly say to update the python-arvados-fuse etc. packages on your slurm/lsf nodes -- is this worth adding? (We do mention upgrading the compute image for cloud.)

I'm not sure the parameterized tests are really working.

First, this block_cache class variable in mount_test_base.py should really be disk_cache, so it's controlled by the parameterized tests with {"disk_cache": True}, right?

class MountTestBase(unittest.TestCase):
    block_cache = False
...
make_block_cache(self.block_cache)

Even with this change, disk_cache=True doesn't seem to propagate everywhere like I expected. I tried showing the block_cache arg and traceback.print_stack() in KeepClient.__init__ and found that, for example, the collection in FuseCreateFileTest.runTest() seems to be setting up its own KeepClient that uses a memory cache:

KeepClient.__init__: block_cache = None
...
  File "/home/tom/arvados/services/fuse/tests/test_mount.py", line 431, in runTest
    collection.save_new()
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/arvfile.py", line 469, in must_be_writable_wrapper
    return orig_func(self, *args, **kwargs)
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/arvfile.py", line 270, in synchronized_wrapper
    return orig_func(self, *args, **kwargs)
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/retry.py", line 177, in num_retries_setter
    return orig_func(self, *args, **kwargs)
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1732, in save_new
    self._my_block_manager().commit_all()
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/arvfile.py", line 270, in synchronized_wrapper
    return orig_func(self, *args, **kwargs)
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1429, in _my_block_manager
    self._block_manager = _BlockManager(self._my_keep(),
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/arvfile.py", line 270, in synchronized_wrapper
    return orig_func(self, *args, **kwargs)
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/collection.py", line 1420, in _my_keep
    self._keep_client = KeepClient(api_client=self._api_client)
  File "/home/tom/arvados/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/keep.py", line 897, in __init__
    traceback.print_stack()
KeepBlockCache: self._disk_cache = False

I haven't tried to figure out whether the propagation happens properly in the non-testing code, although it seems like we'd already have memory explosion problems if the regular fuse code paths caused multiple block caches to be created. Is it worth adding something like "error out if some code path instantiates the wrong kind of block cache" to the testing setup?

Actions #14

Updated by Tom Clegg over 1 year ago

I see a couple of these. Is this really okay? The one in DiskCacheSlot.set() looks suspicious -- won't that result in a cache slot that behaves as if everything is OK but returns empty content?

        except Exception as e:
            traceback.print_exc()
We should probably handle errors like these at least gracefully enough to suggest switching to memory cache:
  • ENODEV The underlying filesystem of the specified file does not support memory mapping.
  • ENOMEM The process's maximum number of mappings would have been exceeded.
It looks like ENOSPC will cause DiskCacheSlot.set() to start leaving partially-written temp files in cache dir (as well as printing tracebacks), and the temp files won't get cleaned up until a future init_cache() hits max slots and starts deleting old files. Perhaps we could
  • in set(), delete the temp file if we hit an exception before renaming it into place
  • in init_cache(), delete temp files that are > 5 seconds old and aren't flocked

We should use a recognizable naming pattern for tempfiles, and check that a filename matches hash or tempfile pattern before deleting it -- currently I suspect a misconfiguration like --cache-dir=$HOME could be silently disastrous.

Defaulting to 1/2 available disk space up to 64 GiB seems excessive.
  • on a system with several users, 64 GiB per user is an awful lot of space for "things that went through arv-get or arv-put some time ago" (arv-get/put will use this, right?)
  • successive invocations would use smaller and smaller caches (for the single-user case we could fix this by adding the existing cache size when calculating free space)

Perhaps min(1G, 0.1*total_disk_size) would be a reasonable default?

Actions #15

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Actions #16

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-13:

Should we make it a config error to have ram>0 and disk>0? currently in that case we reserve both, but don't use the ram.

It's sort of harmless but yea, we could make it a warning or error.

The upgrade instructions don't explicitly say to update the python-arvados-fuse etc. packages on your slurm/lsf nodes -- is this worth adding? (We do mention upgrading the compute image for cloud.)

I'm not sure the parameterized tests are really working.

First, this block_cache class variable in mount_test_base.py should really be disk_cache, so it's controlled by the parameterized tests with {"disk_cache": True}, right?

Yep, that was a mistake. Fixed.

[...]

Even with this change, disk_cache=True doesn't seem to propagate everywhere like I expected. I tried showing the block_cache arg and traceback.print_stack() in KeepClient.__init__ and found that, for example, the collection in FuseCreateFileTest.runTest() seems to be setting up its own KeepClient that uses a memory cache:

[...]

I think those cases were all in places where it was creating a Collection object to write files, so the disk cache flag didn't really matter. What I did do was add a check to the constructor to look for ThreadsafeApiCache objects and use the KeepClient from there, instead of creating a new one. So I think this ensures it is actually using the desired KeepClient object with the disk cache in those cases.

I haven't tried to figure out whether the propagation happens properly in the non-testing code, although it seems like we'd already have memory explosion problems if the regular fuse code paths caused multiple block caches to be created. Is it worth adding something like "error out if some code path instantiates the wrong kind of block cache" to the testing setup?

I'd have to set up some testing-only global flag that it would have to check against, I don't know if that's really necessary.

Tom Clegg wrote in #note-14:

I see a couple of these. Is this really okay? The one in DiskCacheSlot.set() looks suspicious -- won't that result in a cache slot that behaves as if everything is OK but returns empty content?

[...]

When ready.set() and content is 'None' that's used to mean "a cache slot was reserved for this, but no value was provided, so the read must have failed" -- in that case it will go try to read the block from a different keepstore.

We should probably handle errors like these at least gracefully enough to suggest switching to memory cache:
  • ENODEV The underlying filesystem of the specified file does not support memory mapping.
  • ENOMEM The process's maximum number of mappings would have been exceeded.

It now catches those errors and gives appropriate messages. In the case of a failure, it just directly falls back to RAM caching behavior (it sets self.content = value)

It looks like ENOSPC will cause DiskCacheSlot.set() to start leaving partially-written temp files in cache dir (as well as printing tracebacks), and the temp files won't get cleaned up until a future init_cache() hits max slots and starts deleting old files. Perhaps we could
  • in set(), delete the temp file if we hit an exception before renaming it into place
  • in init_cache(), delete temp files that are > 5 seconds old and aren't flocked

Done.

We should use a recognizable naming pattern for tempfiles, and check that a filename matches hash or tempfile pattern before deleting it -- currently I suspect a misconfiguration like --cache-dir=$HOME could be silently disastrous.

Good idea. Blocks get an extension and now it won't mess with any files that don't have that extension.

Defaulting to 1/2 available disk space up to 64 GiB seems excessive.
  • on a system with several users, 64 GiB per user is an awful lot of space for "things that went through arv-get or arv-put some time ago" (arv-get/put will use this, right?)
  • successive invocations would use smaller and smaller caches (for the single-user case we could fix this by adding the existing cache size when calculating free space)

Perhaps min(1G, 0.1*total_disk_size) would be a reasonable default?

The calculation is now the smallest of:

  • 10% of total disk size
  • 25% of available space
  • max_slots * 64 MiB

I added a few new tests specifically for disk caching behavior.

18842-arv-mount-disk-config @ 2cddd3c4e531e37c6d960c08da91a552bf75a1cc

developer-run-tests: #3391

Actions #17

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Actions #18

Updated by Tom Clegg over 1 year ago

If we handle errors by caching the block in memory, there seem to be a couple of not-too-unlikely scenarios (can't create/write cache dir, disk full) where we end up caching in memory but using a size limit that was chosen based on disk space -- either specified by caller or a default based on total/available disk space -- which will often be unrealistic for a memory cache. I'm not sure of the best way to handle this (or how close to "best" we really need to get). Thoughts:
  • If the cache dir is completely unusable at startup (can't even create/replace a one-byte file in DiskCacheSlot.__init__()) maybe it's better to error out right away? (if so, this seems easy)
  • If things are working but then we hit ENOSPC when we're at 12G of our 25G limit, maybe it's better to reduce the limit to 10G, evict some old blocks to get down to the new limit, and retry? (this seems a little less easy)
  • Is it feasible to set a smaller limit for the fall-back-to-memory cache -- 4 blocks, a percentage of RAM, or ...? I don't see a convenient way to do this with the current code structure but maybe you do?

When disk cache fails, there's at least some chance you have enough memory to succeed with the current code, so giving up too easily would be unfortunate. OTOH, automatically escalating a variety of disk errors into OOM errors would also be unfortunate.

Rest LGTM.

Actions #19

Updated by Tom Clegg over 1 year ago

The calculation is now the smallest of:

  • 10% of total disk size
  • 25% of available space
  • max_slots * 64 MiB

I think this means we need to add some stuff like needScratch = max(needScratch, ctr.RuntimeConstraints.KeepCacheDisk*10) to lib/dispatchcloud/node_size.go.

I think we also need to add the pre-existing cache usage to "avail" before clamping to that limit.

comment typo "256 GiB in RAM" should be 256 MiB (phew)

Actions #20

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-18:

  • If the cache dir is completely unusable at startup (can't even create/replace a one-byte file in DiskCacheSlot.__init__()) maybe it's better to error out right away? (if so, this seems easy)

Done, now writes a one-byte block at startup to catch early errors.

  • If things are working but then we hit ENOSPC when we're at 12G of our 25G limit, maybe it's better to reduce the limit to 10G, evict some old blocks to get down to the new limit, and retry? (this seems a little less easy)

Done. When it gets ENOSPC or ENOMEM, it adjusts the limit down and tries again.

  • Is it feasible to set a smaller limit for the fall-back-to-memory cache -- 4 blocks, a percentage of RAM, or ...? I don't see a convenient way to do this with the current code structure but maybe you do?

When disk cache fails, there's at least some chance you have enough memory to succeed with the current code, so giving up too easily would be unfortunate. OTOH, automatically escalating a variety of disk errors into OOM errors would also be unfortunate.

As discussed at standup, the choice is either to fall back to RAM cache behavior or raise errors. I think we decided that if things have really gone screwy, it's better to surface the error than to hide it, so that is what it does now.

I think this means we need to add some stuff like needScratch = max(needScratch, ctr.RuntimeConstraints.KeepCacheDisk*10) to lib/dispatchcloud/node_size.go.

Ah, no. The calculation I referred to was the default cache if unspecified. If you provide an explicit size (which is what crunch-run does) then it just uses exactly that size. It already adds KeepCacheDisk to needScratch in EstimateScratchSpace().

I think we also need to add the pre-existing cache usage to "avail" before clamping to that limit.

Good idea. Done.

comment typo "256 GiB in RAM" should be 256 MiB (phew)

Fixed.

18842-arv-mount-disk-config @ 650265966e83fca4ce8e9a416e9b5e358e82be98

developer-run-tests: #3393

Actions #21

Updated by Tom Clegg over 1 year ago

In DiskCacheSlot.evict() would it be a good idea to have an explicit self.content.close() and self.filehandle.close() instead of just assigning None?

In DiskCacheSlot.set(), don't we need to call self.ready.set() after assigning content = mmap() so get() doesn't wait forever? (The memory cache does so, OTOH none of the tests deadlock, so maybe I'm missing something.)

Rest LGTM.

Actions #22

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-21:

In DiskCacheSlot.evict() would it be a good idea to have an explicit self.content.close() and self.filehandle.close() instead of just assigning None?

            # The mmap region might be in use when we decided to evict
            # it.  This can happen if the cache is too small.
            #
            # If we call close() now, it'll throw an error if
            # something tries to access it.
            #
            # However, we don't need to explicitly call mmap.close()
            #
            # I confirmed in mmapmodule.c that that both close
            # and deallocate do the same thing:
            #
            # a) close the file descriptor
            # b) unmap the memory range
            #
            # So we can forget it in the cache and delete the file on
            # disk, and it will tear it down after any other
            # lingering Python references to the mapped memory are
            # gone.

In DiskCacheSlot.set(), don't we need to call self.ready.set() after assigning content = mmap() so get() doesn't wait forever? (The memory cache does so, OTOH none of the tests deadlock, so maybe I'm missing something.)

This is done by KeepBlockCache.set() because it now re-tries if there is a failure the first time.

        try:
            if tryagain:
                # There was an error.  Evict some slots and try again.
                self.cap_cache()
                slot.set(blob)
        finally:
            # Set the notice that that we are done with the cache
            # slot one way or another.
            slot.ready.set()

The logic is slightly silly in that if tryagain is False it won't do anything but it will still process the finally block.

I cleaned up the control flow to hopefully be less confusing.

Rest LGTM.

I will merge if this passes.

18842-arv-mount-disk-config @ 31f36e0c8aca5afe43cbe6273ecb8509fa41600b

developer-run-tests: #3397

Actions #23

Updated by Tom Clegg over 1 year ago

Aha. So (memory-backed) CacheSlot.set() doesn't need to call self.ready.set() any more?

Actions #24

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-23:

Aha. So (memory-backed) CacheSlot.set() doesn't need to call self.ready.set() any more?

I refactored it. Now both memory cache and disk cache call self.ready.set() when the cache is set successfully. The KeepCache.set() method will call slot.set(None) in the case where it repeatedly failed to set the cache slot, which will call self.ready.set().

18842-arv-mount-disk-config @ 1f2223927141f11055c740c9c047ba007ab81e14

developer-run-tests: #3400

Actions #25

Updated by Peter Amstutz over 1 year ago

  • Status changed from In Progress to Resolved
Actions #27

Updated by Peter Amstutz over 1 year ago

  • Release set to 47
Actions #28

Updated by Brett Smith about 1 year ago

  • Related to Bug #19981: Containers that used an old DefaultKeepCacheRAM no longer get reused after a configuration change added
Actions

Also available in: Atom PDF