Project

General

Profile

Actions

Bug #21541

closed

arv-mount KeyError during cap_cache - Seemingly lost track of parent inode

Added by Brett Smith 2 months ago. Updated 26 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Story points:
-
Release relationship:
Auto

Description

User's arv-mount process crashed with this traceback. Afterward trying to list files in the mount root returned EIO.

2024-02-23 23:36:17 arvados.arvados_fuse[2803055] ERROR: Unhandled exception during FUSE operation
Traceback (most recent call last):
  File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 327, in catch_exceptions_wrapper
    return orig_func(self, *args, **kwargs)
  File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 570, in lookup
    self.inodes.touch(p)
  File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 276, in touch
    self.inode_cache.touch(entry)
  File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 234, in touch
    self.manage(obj)
  File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 228, in manage
    self.cap_cache()
  File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 212, in cap_cache
    self._remove(ent, True)
  File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 186, in _remove
    obj.kernel_invalidate()
  File "venv/lib/python3.10/site-packages/arvados_fuse/fusedir.py", line 220, in kernel_invalidate
    parent = self.inodes[self.parent_inode]
  File "venv/lib/python3.10/site-packages/arvados_fuse/__init__.py", line 260, in __getitem__
    return self._entries[item]
KeyError: 865

This exact same traceback appeared seven times in one second. It's not clear whether that's multiple threads running into the same issue, or the error recurring because of different accesses.

Note this mount is intentionally accessible to multiple users on the host. You can assume there was concurrent access. Unfortunately for the same reason it's hard to know whether a specific operation caused the error.


Files

arv-mount-stress-test.py (5.44 KB) arv-mount-stress-test.py Brett Smith, 03/01/2024 11:54 PM
arv-mount-stress-test.py (5.93 KB) arv-mount-stress-test.py Version 3 Brett Smith, 03/22/2024 03:13 PM

Subtasks 2 (0 open2 closed)

Task #21555: Review 21541-arv-mount-keyerror-rebaseResolvedPeter Amstutz04/02/2024Actions
Task #21596: Review https://github.com/arvados/python-llfuseResolvedPeter Amstutz03/18/2024Actions

Related issues

Related to Arvados - Bug #21568: arv-mount double free or corruption with many concurrent accessesResolvedPeter AmstutzActions
Actions #2

Updated by Brett Smith 2 months ago

  • Subject changed from arv-mount KeyError in cap_cache - Seemingly lost track of parent inode to arv-mount KeyError during cap_cache - Seemingly lost track of parent inode
Actions #3

Updated by Peter Amstutz 2 months ago

Do we have the latest command line they are using?

This vaguely looks like cache management is causing the lookup to evict its own parent, which shouldn't happen if the cache is adequately sized, but also it should be properly pinned so it can't be evicted while it is part of a request.

Actions #4

Updated by Brett Smith 2 months ago

Peter Amstutz wrote in #note-3:

Do we have the latest command line they are using?

Edited numbers for readability, I promise I copied and pasted their arguments into units to do the conversion to prevent human math mistakes:

ExecStart=VENV_DIR/bin/arv-mount --allow-other --foreground --read-only --directory-cache=[1GB] --disk-cache-dir=CACHE_DIR --file-cache=[100GB] --project=UUID MOUNT_DIR
Actions #5

Updated by Peter Amstutz 2 months ago

  • Target version changed from Future to Development 2024-03-13 sprint
Actions #6

Updated by Peter Amstutz 2 months ago

  • Assigned To set to Peter Amstutz
Actions #7

Updated by Brett Smith about 2 months ago

Successfully reproduced using the attached script with --jobs=6, running arv-mount configured with a pirca admin token.

Mar 01 18:22:50 systemd[2070]: Started arv-mount-stress-nyzb4m3r.service - /opt/arvados/pyclients/bin/arv-mount --foreground --read-only --shared --directory-cache=2097152 -- /run/user/1000/arv-mount-stress-nyzb4m3r.
Mar 01 18:41:23 arv-mount[1147804]: 2024-03-01 18:41:23 arvados.arvados_fuse[1147804] ERROR: Unhandled exception during FUSE operation
Mar 01 18:41:23 arv-mount[1147804]: Traceback (most recent call last):
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/__init__.py", line 327, in catch_exceptions_wrapper
Mar 01 18:41:23 arv-mount[1147804]:     return orig_func(self, *args, **kwargs)
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/__init__.py", line 570, in lookup
Mar 01 18:41:23 arv-mount[1147804]:     self.inodes.touch(p)
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/__init__.py", line 276, in touch
Mar 01 18:41:23 arv-mount[1147804]:     self.inode_cache.touch(entry)
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/__init__.py", line 234, in touch
Mar 01 18:41:23 arv-mount[1147804]:     self.manage(obj)
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/__init__.py", line 228, in manage
Mar 01 18:41:23 arv-mount[1147804]:     self.cap_cache()
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/__init__.py", line 212, in cap_cache
Mar 01 18:41:23 arv-mount[1147804]:     self._remove(ent, True)
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/__init__.py", line 186, in _remove
Mar 01 18:41:23 arv-mount[1147804]:     obj.kernel_invalidate()
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/fusedir.py", line 233, in kernel_invalidate
Mar 01 18:41:23 arv-mount[1147804]:     parent = self.inodes[self.parent_inode]
Mar 01 18:41:23 arv-mount[1147804]:   File "/opt/arvados/pyclients/lib/python3.8/site-packages/arvados_fuse/__init__.py", line 260, in __getitem__
Mar 01 18:41:23 arv-mount[1147804]:     return self._entries[item]
Mar 01 18:41:23 arv-mount[1147804]: KeyError: 188

When this happened, I got two stack traces each for inodes 188 and 617, as well as one each for 195 and 196. Note that number matches the number of parallel processes running.

Literally all you have to do is list directories in parallel. You don't need to access Keep at all, just the Arvados API.

If the directory cache size matters, the connection is not obvious.

With a single-digit number of jobs, expect a run to take about ten to fifteen minutes over a remote wifi connection outside the cloud.

Actions #8

Updated by Brett Smith about 2 months ago

  • Related to Bug #21568: arv-mount double free or corruption with many concurrent accesses added
Actions #9

Updated by Peter Amstutz about 2 months ago

21541-arv-mount-keyerror @ 32e56d60fd965260662b5c8d8aaafc0d793e33bd

developer-run-tests: #4069

The most significant change is that opendir() marks all of the directory entries as "in use", so they are not candidates to be cleared from cache until releasedir() is called. This is because entries returned by readdir() are not considered to increment the kernel reference.

Related, when updating the collection contents, it also marks directory entries as "in use", again to avoid removing entries from the cache that were just added.

With these changes, I have not seen the original KeyError.

Other changes:

  • Directory objects now compute a default object size so they contribute to the total cache usage for memory management purposes
  • Updates the cache computation when directory contents changes
  • Some stylistic tweaks

Regarding the segmentation fault:

Tentatively it seems that we can avoid this if we run llfuse.main(workers=1). This means only one handler runs at a time.

Regarding deadlocks:

I also haven't seen any deadlocks with llfuse.main(workers=1). However I don't know if that's because there's no lock contention, or because the deadlocks are a symptom of the crashing behavior.

Actions #11

Updated by Peter Amstutz about 2 months ago

du -h --total /run/user/1000/arv-mount-stress-3_fdlb_q

several hours later...

I wonder why if it's finished yet?

/run/user/1000/arv-mount-stress-3_fdlb_q/Brett Smith/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/Filter Group/WGS Run #20497/Container log for request pirca-xvhdp-5kt9js1fehbdr1q/log for container pirca-dz642-dbe7myld42l3xmu

Now I know why the stress test never finished.

Actions #12

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #13

Updated by Peter Amstutz about 2 months ago

  • Release set to 69
Actions #14

Updated by Peter Amstutz about 2 months ago

21541-arv-mount-keyerror @ e74e038e90759c03a03e76021d0c5b9ab7a9a8ca

(I still have to rebase and clean up this branch to get it ready for review)

$ arvswitch pirca
$ arv-mount --shared /tmp/mountpt1/
$ du -h --total /tmp/mountpt1/
(probably took 6-8 hours, not sure exactly, left it running overnight and I didn't run it with 'time')
1.3P    total

$ top
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 620129 peter     20   0 2050084   1.3g  12028 S   0.0   5.7      9,07 arv-mount 

This shows a couple important things:

  • It was able to run for hours and hours doing a meaningful task without errors or deadlocks
  • The memory growth is very modest -- although it would have been even better it had hit a ceiling and held completely steady, peaking at 1.3g after 16 hours is a huge improvement over what it was doing before (which was bloating to 2+ GB after just 5-10 minutes and getting to 8-9 GB after an hour or two)

I still have to run the stress test just to hammer on it with concurrent access, I'll probably only leave that running for 20-30 minutes because enumerating the entire 'shared' directory would probably take another 8 hours.

Actions #15

Updated by Peter Amstutz about 2 months ago

Also:

2024-03-14 10:29:19,999 arv-mount-stress[814948]: DEBUG: starting mount service arv-mount-stress-mqtnq7k7.service
2024-03-14 10:29:20,054 arv-mount-stress[814948]: DEBUG: waiting for mount
2024-03-14 10:29:25,693 arv-mount-stress[814948]: DEBUG: starting stress test #1
2024-03-14 11:23:16,526 arv-mount-stress[814948]: DEBUG: completed stress test #1

For whatever reason, this seems to have completed much faster than 'du'. I think that is probably because it is only looking at directories instead of getting metadata for every file like 'du'.

Actions #16

Updated by Peter Amstutz about 1 month ago

https://github.com/arvados/python-llfuse @ 6809896f69d821d5320183711a122fef5709e795 (master branch)

I've changed various text and scripts to reference a new package called arvados-llfuse.

I went through the steps in developer-notes/release-process.md (except the final twine upload).

I'd appreciate a sanity check before uploading the package to PyPi.

I intend to make a pull request to https://github.com/python-llfuse/python-llfuse but we can't block on a 3rd party choosing to merge and release or not.

Actions #17

Updated by Peter Amstutz about 1 month ago

  • Status changed from New to In Progress
Actions #18

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-16:

https://github.com/arvados/python-llfuse @ 6809896f69d821d5320183711a122fef5709e795 (master branch)

This looks good to me. I was able to build the fork myself, install it in my test environment, and successfully run arv-mount tests with it.

I intend to make a pull request to https://github.com/python-llfuse/python-llfuse but we can't block on a 3rd party choosing to merge and release or not.

I don't object to this plan, but this framing is kind of begging the question, since you've already prepared the entire fork before even bothering to file an issue or submit a PR.

Actions #22

Updated by Peter Amstutz about 1 month ago

21541-arv-mount-keyerror-rebase @ 6dfef004d33b147cbe80bbb5ecc6922ac25f156d

developer-run-tests: #4093

  • All agreed upon points are implemented / addressed.
    • bugs described in #21541 and #21568 are resolved.
    • fixed a deadlock condition (llfuse notify queue filled up) that came up in testing
    • significantly reduced growth in memory usage
    • filter groups won't result infinitely recursive directories, because that breaks programs that do directory traversal (this is the same behavior that was implemented in keep-web for the same reason).
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • filed #21607 because memory usage still creeps up over time
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • tested using arv-mount-stress-test.py
    • tested by running "du --total" on the entire contents of "shared" for pirca
    • passed the arv-mount test suite
  • Documentation has been updated.
    • n/a, no new behavior, a hidden command line option was added for testing but it is not intended to be exposed to users
  • Behaves appropriately at the intended scale (describe intended scale).
    • improvements to memory management mean better scaling when accessing many projects and collections
  • Considered backwards and forwards compatibility issues between client and server.
    • no changes to how arv-mount communicates with the API server
  • Follows our coding standards and GUI style guidelines.
    • I moved some of the comments into docstrings but the FUSE API as a whole isn't intended to be user facing.

Discussion from the commit message:

  • Fixes a segfault on startup due to multiple threads fetching the cluster config using the same http2 object, which is not threadafe. Now fetches the relevant configuration item once (ForwardSlashNameSubstitution), and stores it where all the threads can access it. (bug #21568)
  • Fixes KeyError thrown where a parent inode is removed from the inodes table before its children.
  • In the process of testing, re-discovered a bug where, if the llfuse _notify_queue fills up, the entire mount process deadlocks.
    The previous fix worked by monkey-patching llfuse to replace a limited-length queue with an unlimited length queue, however changes in subsequent llfuse versions caused that variable to be hidden from Python (so the monkey patch didn't fail but it no longer had any effect either). The solution is to introduce an additional unlimited-size queue in between the operation handlers and the limited-size kernel notification queue.
  • Because cache management and inode cleanup interact with kernel notifications (which were moved into a separate thread), I decided they should also be asynchronous from the operation handlers, so they are now part of the same thread that processes kernel notifications.
  • Attempting to remove an inode that is in use will now at minimum send a kernel invalidation, which will sometimes nudge the kernel to forget the inode, enabling us to remove it.
  • Filter groups now check if the filter group contains itself so it
    doesn't create an infinite path loop that breaks filesystem traversal
    tools.
  • In the process of testing, found that llfuse didn't wait for the _notify_queue to drain before closing the FUSE channel, resulting in a segfault if the _notify_loop thread tried to process any events after shutdown started. This bug cannot be worked around on the Arvados side, so I have prepared an arvados-llfuse fork with a bug fix.
  • Testing with arv-mount-stress-test (which creates 8 subprocesses that all traverse the filesystem at the same time) now passes with no filesystem errors, no deadlocks, no segfaults, and modest memory usage.
Actions #23

Updated by Brett Smith about 1 month ago

I am still looking at this on my end, but I am still seeing an issue where the mount mysteriously disappears. I don't know yet if this is happening because systemd decides for itself that it should make the mount go away, or if it's reacting to something that happens inside the process.

Attached is version 3 of the stress test script. I originally wrote this version to solve the infinite recursion issue, but it does also make the walk a little more efficient by not recursing directories that are not really relevant to the test.

$ python3 ~/Curii/Notes/arv-mount-stress-test.py 
2024-03-22 10:56:26,651 arv-mount-stress[164142]: DEBUG: starting mount service arv-mount-stress-1mgrui64.service
2024-03-22 10:56:26,683 arv-mount-stress[164142]: DEBUG: waiting for mount
2024-03-22 10:56:30,094 arv-mount-stress[164142]: DEBUG: starting stress test #1
multiprocessing.pool.RemoteTraceback: 
""" 
Traceback (most recent call last):
  File "/opt/arvados/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/brett/Curii/Notes/arv-mount-stress-test.py", line 91, in walk_dir
    child_paths = [path for path in root_path.iterdir() if path.is_dir()]
  File "/home/brett/Curii/Notes/arv-mount-stress-test.py", line 91, in <listcomp>
    child_paths = [path for path in root_path.iterdir() if path.is_dir()]
  File "/opt/arvados/lib/python3.8/pathlib.py", line 1122, in iterdir
    for name in self._accessor.listdir(self):
FileNotFoundError: [Errno 2] No such file or directory: '/run/user/1000/arv-mount-stress-1mgrui64/User Name/my WGS Processing Workflow/Container log for container pirca-dz642-i59m06gtm8ddbi0'
""" 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/brett/Curii/Notes/arv-mount-stress-test.py", line 183, in <module>
    sys.exit(main())
  File "/home/brett/Curii/Notes/arv-mount-stress-test.py", line 162, in main
    stress_returncode = stress_mount_process_pool(mount_state, args.jobs)
  File "/home/brett/Curii/Notes/arv-mount-stress-test.py", line 97, in stress_mount_process_pool
    for result in pool.imap_unordered(walk_dir, mount_state.mount_path.iterdir()):
  File "/opt/arvados/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: '/run/user/1000/arv-mount-stress-1mgrui64/User Name/my WGS Processing Workflow/Container log for container pirca-dz642-i59m06gtm8ddbi0'

$ journalctl --no-hostname -S 10:55 | grep -Ev ' (gnome-shell|rtkit-daemon)\['
Mar 22 10:56:26 python[164142]: starting mount service arv-mount-stress-1mgrui64.service
Mar 22 10:56:26 systemd[2074]: Started arv-mount-stress-1mgrui64.service - /opt/arvados/pyclients/bin/arv-mount --foreground --read-only --shared --directory-cache=2097152 -- /run/user/1000/arv-mount-stress-1mgrui64.
Mar 22 10:56:26 python[164142]: waiting for mount
Mar 22 10:56:30 python[164142]: starting stress test #1
Mar 22 11:00:23 systemd[1]: run-user-1000-arv\x2dmount\x2dstress\x2d1mgrui64.mount: Deactivated successfully.
Mar 22 11:00:23 arv-mount[164494]: fusermount: entry for /run/user/1000/arv-mount-stress-1mgrui64 not found in /etc/mtab
Mar 22 11:00:23 systemd[2074]: arv-mount-stress-1mgrui64.service: Consumed 3min 42.708s CPU time.
Actions #24

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-22:

21541-arv-mount-keyerror-rebase @ 6dfef004d33b147cbe80bbb5ecc6922ac25f156d

So my main concern is about, all the cache handling code requires keeping a lot of complex data structures in sync, are there ways that we can help future readers understand the structure of the data up front to follow what's going on here?

My biggest concern is with the entries to self._inode_remove_queue, which is "a tuple starting with an arbitrary string followed by an arbitrary number of arguments with arbitrary types," so the producers scattered throughout the code and lookup table in _inode_op all have to be kept in sync. Could each entry be made into a cheap dataclass? Maybe with a method to do the operation in _inode_op? Then we could rely on method dispatch instead of an explicit, separate lookup table.

If that's too much, the next best thing would be to have these operations be a collections.namedtuple, where the first item is an enum.Enum that spells out the possible operations.

Could self._by_uuid be a collections.defaultdict(list)? This would let you remove a little data structure manipulation code, but again, I think the bigger win is it gives future readers more information up-front about what this is.

In _inode_remove, I was confused at first why the entry-get-and-handle code seemed to be duplicated, and it took me a few passes until I spotted the difference. I think this would be clearer written something like:

while True:
    blocking_get = True
    try:
        while True:
            try:
                entry = self._inode_remove_queue.get(blocking_get)
            except queue.Empty:
                break
            blocking_get = False
            if entry is None:
                return
            _logger.debug("_inode_remove %s", entry)
            if self._inode_op(entry, locked_ops):
                self._inode_remove_queue.task_done()

Smaller code cleanups:

self._cache_entries = collections.OrderedDict()

Built-in dictonaries are ordered since Python 3.7, this is basically a noop and can just be written {}.

for ent in listvalues(self._cache_entries):

listvalues is a Python 2 compatibility shim that we're gonna need to remove at some point (#21356). Prefer self._cache_entries.values().

diff --git a/sdk/python/arvados/safeapi.py b/sdk/python/arvados/safeapi.py
index 56b92e8f08..7acabd7b58 100644
--- a/sdk/python/arvados/safeapi.py
+++ b/sdk/python/arvados/safeapi.py
@@ -61,21 +61,21 @@ class ThreadSafeApiCache(object):
         self.keep = keep.KeepClient(api_client=self, **keep_params)

     def localapi(self) -> 'googleapiclient.discovery.Resource':
-        try:
+        if 'api' in self.local.__dict__:
             client = self.local.api
-        except AttributeError:
+        else:
             client = api.api_client(**self._api_kwargs)
             client._http._request_id = lambda: self.request_id or util.new_request_id()
             self.local.api = client

I suspect this is a remnant from when we were trying to address this bug at the ThreadSafeApiCache level, and this change isn't really doing much. The old version is preferable because the new version bypasses a lot of the normal Python attribute lookup rules like calling __getattr__/__getattribute__, etc.

If this is actually a load-bearing change, it needs comments explaining why it needs to be this way. If it's not, prefer reverting it.

In the tests, in general, I don't understand why all the unmount timeouts increased to 60. I feel like we have a constant low-level background edit war about stuff like this, where we increase test timeouts when tests seem slow in development, and then we decrease them again when we get mad about tests that can hang for a long time in CI. 10 seconds already feels like a fairly long time. What's the specific rationale for increasing all the way to 60?

If 60 needs to stay, "20s" needs to be "60s" below for consistency:

diff --git a/services/fuse/tests/mount_test_base.py b/services/fuse/tests/mount_test_base.py
index 8a3522e0cb..e0479d3668 100644
--- a/services/fuse/tests/mount_test_base.py
+++ b/services/fuse/tests/mount_test_base.py
@@ -95,24 +95,24 @@ class MountTestBase(unittest.TestCase):
                 self.operations.events.close(timeout=10)
             subprocess.call(["fusermount", "-u", "-z", self.mounttmp])
             t0 = time.time()
-            self.llfuse_thread.join(timeout=10)
+            self.llfuse_thread.join(timeout=60)
             if self.llfuse_thread.is_alive():
                 logger.warning("MountTestBase.tearDown():" 
-                               " llfuse thread still alive 10s after umount" 
+                               " llfuse thread still alive 20s after umount" 
                                " -- exiting with SIGKILL")
                 os.kill(os.getpid(), signal.SIGKILL)
             waited = time.time() - t0
Actions #25

Updated by Peter Amstutz about 1 month ago

Notes on invalidation notifications that get sent back to the kernel. I've not found this properly documented anywhere, so this is gleaned from the kernel source itself. I am going to put this in a code comment but I wanted to write it down here first.

inode invalidation on a file - This clears the attributes, permissions, and page cache of contents. Pretty straightforward.

inode invalidation on a directory - This clears the attributes, permissions, and (I believe) the readdir cache of contents.

directory entry invalidation - This clears the cached lookup() result for a specific filename on a particular directory inode (the "dentry" cache), and also invalidates the parent directory.

One thing that's counter-intuitive (but we take advantage of with the "magic" directory and "magic" files) is that readdir results and lookup results are basically independent.

The kernel maintains a reference count on files, as a result of lookup() or other operation such as open(). It tells the fuse driver when the reference count is decremented using forget(). the inode is expected to exist at least until the reference count goes to 0.

It seems that dentries hold references, so it is necessary to invalidate them in order to get the kernel to forget an inode.

The kernel also can have negative dentries that cache lookups that return file not found. I supposed that means if a "not found" error has been returned and then later new files appear in the directory, we need to send invalidation for the new files.

This is important because we want to be able to release things from memory and then bring them back later, when they come back they'll have different inode numbers, the kernel needs to have a clear picture about what is valid in the cache or users can get back stale results.

Actions #26

Updated by Peter Amstutz about 1 month ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #27

Updated by Peter Amstutz about 1 month ago

Brett Smith wrote in #note-24:

So my main concern is about, all the cache handling code requires keeping a lot of complex data structures in sync, are there ways that we can help future readers understand the structure of the data up front to follow what's going on here?

My biggest concern is with the entries to self._inode_remove_queue, which is "a tuple starting with an arbitrary string followed by an arbitrary number of arguments with arbitrary types," so the producers scattered throughout the code and lookup table in _inode_op all have to be kept in sync. Could each entry be made into a cheap dataclass? Maybe with a method to do the operation in _inode_op? Then we could rely on method dispatch instead of an explicit, separate lookup table.

Done.

If that's too much, the next best thing would be to have these operations be a collections.namedtuple, where the first item is an enum.Enum that spells out the possible operations.

Could self._by_uuid be a collections.defaultdict(list)? This would let you remove a little data structure manipulation code, but again, I think the bigger win is it gives future readers more information up-front about what this is.

Sure. I also moved this over to "Inodes" instead of "InodesCache".

In _inode_remove, I was confused at first why the entry-get-and-handle code seemed to be duplicated, and it took me a few passes until I spotted the difference. I think this would be clearer written something like:

Done.

Built-in dictonaries are ordered since Python 3.7, this is basically a noop and can just be written {}.

https://docs.python.org/3.11/library/collections.html#ordereddict-objects

  • The OrderedDict algorithm can handle frequent reordering operations better than dict. As shown in the recipes below, this makes it suitable for implementing various kinds of LRU caches.
  • OrderedDict has a move_to_end() method to efficiently reposition an element to an endpoint.

listvalues is a Python 2 compatibility shim that we're gonna need to remove at some point (#21356). Prefer self._cache_entries.values().

Done

I suspect this is a remnant from when we were trying to address this bug at the ThreadSafeApiCache level, and this change isn't really doing much. The old version is preferable because the new version bypasses a lot of the normal Python attribute lookup rules like calling __getattr__/__getattribute__, etc.

If this is actually a load-bearing change, it needs comments explaining why it needs to be this way. If it's not, prefer reverting it.

I reverted it.

In the tests, in general, I don't understand why all the unmount timeouts increased to 60. I feel like we have a constant low-level background edit war about stuff like this, where we increase test timeouts when tests seem slow in development, and then we decrease them again when we get mad about tests that can hang for a long time in CI. 10 seconds already feels like a fairly long time. What's the specific rationale for increasing all the way to 60?

I found that, in testing, with the addition of the thread join() in llfuse (added by the arvados-llfuse fork so our tests would pass reliably) it would sometimes take many seconds to complete. A longer timeout is basically harmless as the llfuse main loop returns as soon as it is safe to do so, but that 2s or 10s or even 20s wasn't long enough to prevent it from incorrectly timing out and killing the tests.

21541-arv-mount-keyerror-rebase @ 3e197765c7cfefe556391661ee9a14abeab2dbb6

developer-run-tests: #4103

  • Add slots to major Directory classes
  • Disconnect FuseArvadosFile from ArvadosFile to reduce cyclic references.
  • Clean up _remove_inode loop and use dataclasses for the inode operations.
  • Now calls del_entry on collection_record_file and project_object_file.

It looks like collection_record_file was holding a reference to the Collection object (and was remaining in the inodes table) even when CollectionDirectory was cleared. I believe this is the memory leak I have been looking for.

  • Remove the "dead" flag and set parent_inode to None instead. This clarifies the behavior that directory entries keep their (numeric) inodes until they are detached from the directory which may have contributed to infrequent "file not found" errors.
  • Adjust cache behavior to only hold objects that are cache-eligible and have non-zero cache_size. This avoids filling the cache with entries that are just going to be skipped over.

Overall: Memory usage is mostly stable but does tend to creep up over time. My best guess is that this is forced because we need to keep inodes in RAM as long as the kernel maintains a reference to them, so with multiple processes accessing different filesystem locations, this is simply RAM required for the working set.

I'm also cautiously optimistic that issues I observed with performance slowing down with long-lived processes are improved (e.g. fixing memory leaks means no more unbounded growth of cache_entries, which means no more time wasted iterating over huge lists).

For load testing I have been using the most recent "arv-mount-stress-test.py" and by running "du" on the "shared" directory of pirca.

Actions #28

Updated by Peter Amstutz about 1 month ago

$ time du --total -h /run/user/1000/arv-mount-stress-8g4mqmha
1.5P    /run/user/1000/arv-mount-stress-8g4mqmha
1.5P    total

real    372m26.864s
user    0m6.532s
sys    0m34.343s
Actions #29

Updated by Peter Amstutz about 1 month ago

2024-03-27 13:02:17,490 arv-mount-stress[2945433]: DEBUG: starting mount service arv-mount-stress-8g4mqmha.service
2024-03-27 13:02:17,492 arv-mount-stress[2945433]: DEBUG: waiting for mount
2024-03-27 13:02:21,131 arv-mount-stress[2945433]: DEBUG: starting stress test #1
2024-03-28 00:31:53,697 arv-mount-stress[2945433]: DEBUG: starting stress test #2
Actions #31

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-30:

21541-arv-mount-keyerror-rebase @ 58fe2158d605b653cd2de61da22bed092dce58b1

Basically this looks good to me. I've been running the stress test on my own system, and it's gone four hours without any issues, which is way longer than any previous test. I'll leave it running for a while longer but I don't think I need to completely reproduce your results for us to move forward.

I have some requests for small changes, most of these are nits. Happy to look at the branch again if you'd like for any reason but I don't know if it's necessary.

My biggest concern is the way we change global GC configuration from Operations.__init__. I think class instantiation changing global state this way is surprising, and we've created maintenance headaches for ourselves in the Python SDK doing similar things. I think it would be better to do this in a place where we know the global state "belongs to us," e.g., from Mount.run or even bin/arv-mount.

  • Add slots to major Directory classes

Can you please update #21505 to note what remains to be done, or close it, whatever's appropriate?

+        # This is a rough guess of the amount of overhead involved for
+        # a collection; the calculation is each file averages 128
+        # bytes in the manifest, but consume 1024 bytes when blown up
+        # into Python data structures.
+        return self._manifest_size * 8

The first time I read this, it wasn't clear to me how the numbers in the comment related to the expression. I'm guessing the connection is 1024 / 128 == 8, but that's still a guess and I'm a little fuzzy on how it all strings together. Is there a clearer way this could be worded?

Built-in dictonaries are ordered since Python 3.7, this is basically a noop and can just be written {}.

https://docs.python.org/3.11/library/collections.html#ordereddict-objects

Sure, this sounds good to me. It would be nice to add a one-sentence comment to highlight for future readers that we care about these features.

Our coding standards say to follow PEP 8, which says,

For sequences, (strings, lists, tuples), use the fact that empty sequences are false:

# Correct:
if not seq:
if seq:

# Wrong:
if len(seq):
if not len(seq):

This came up a few times in the last major commit:

+        values = collections.deque(self._cache_entries.values())
+
+        while len(values) > 0:

Prefer while values:.

+            with llfuse.lock:
+                while len(locked_ops) > 0:
+                    if locked_ops.popleft().inode_op(self, None):
+                        self._inode_remove_queue.task_done()

Prefer while locked_ops:. In Directory.clear:

+        if len(self._entries) == 0:
+            return

Prefer if not self._entries:.

Thanks.

Actions #33

Updated by Peter Amstutz 26 days ago

Brett Smith wrote in #note-31:

Peter Amstutz wrote in #note-30:

21541-arv-mount-keyerror-rebase @ 58fe2158d605b653cd2de61da22bed092dce58b1

My biggest concern is the way we change global GC configuration from Operations.__init__. I think class instantiation changing global state this way is surprising, and we've created maintenance headaches for ourselves in the Python SDK doing similar things. I think it would be better to do this in a place where we know the global state "belongs to us," e.g., from Mount.run or even bin/arv-mount.

I decided to yank this.

  • Add slots to major Directory classes

Can you please update #21505 to note what remains to be done, or close it, whatever's appropriate?

Done.

The first time I read this, it wasn't clear to me how the numbers in the comment related to the expression. I'm guessing the connection is 1024 / 128 == 8, but that's still a guess and I'm a little fuzzy on how it all strings together. Is there a clearer way this could be worded?

I adjusted the wording to try to be more explicit.

Sure, this sounds good to me. It would be nice to add a one-sentence comment to highlight for future readers that we care about these features.

Done

Our coding standards say to follow PEP 8, which says,

Done.

Actions #34

Updated by Peter Amstutz 26 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF