Bug #20909
closed
PySDK tests.test_keep_client.KeepDiskCacheTestCase.test_disk_cache_cap fails on Debian 12 with a "real" $TMPDIR filesystem
Added by Brett Smith over 1 year ago.
Updated about 2 months ago.
Release relationship:
Auto
Description
This test fails consistently on my Debian 12 system running Python 3.11 (from the Debian package) or Python 3.8 (built from source):
======================================================================
FAIL: test_disk_cache_cap (tests.test_keep_client.KeepDiskCacheTestCase.test_disk_cache_cap)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/brett/Curii/arvados/sdk/python/.eggs/mock-3.0.5-py3.11.egg/mock/mock.py", line 1330, in patched
return func(*args, **keywargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brett/Curii/arvados/sdk/python/tests/test_keep_client.py", line 1700, in test_disk_cache_cap
self.assertFalse(os.path.exists(os.path.join(self.disk_cache_dir, self.locator[0:3], self.locator+".keepcacheblock")))
AssertionError: True is not false
This might be specific to my system and a non-issue but looking at the test I'm skeptical. My first guess is that something is changing the ordering of things somewhere such that KeepBlockCache removes the more recent block, not the first one.
Files
- Subject changed from Failing PySDK test on Debian 12/Python 3.11 to Failing PySDK test on Debian 12
This test also fails even if you build your own Python 3.8 and run the tests with it.
- Description updated (diff)
- Target version changed from To be scheduled to Future
- Subject changed from Failing PySDK test on Debian 12 to PySDK tests.test_keep_client.KeepDiskCacheTestCase fails on Debian 12
- Subject changed from PySDK tests.test_keep_client.KeepDiskCacheTestCase fails on Debian 12 to PySDK tests.test_keep_client.KeepDiskCacheTestCase.test_disk_cache_cap fails on Debian 12
The entire sdk/python test suite passes for me on debian 12, stock python 3.11.2. Also tried running this test 32x, and there were no failures.
On my new dev box (i.e., not the same hardware as #note-6), my new debian 12 VM with stock python 3.11.2 fails this test on 29 of 30 attempts.
- Target version changed from Future to Development 2025-02-12
- Target version changed from Development 2025-02-12 to Development 2025-02-26
- Target version changed from Development 2025-02-26 to Development 2025-03-19
- Target version changed from Development 2025-03-19 to Development 2025-02-26
I cannot currently reproduce this in my Debian 12 VM. I created a completely fresh run-tests tempdir with both Python 3.8 and Python 3.11 and ran 10 test sdk/python
for both. Everything passed.
- Assigned To set to Brett Smith
It seems plausible that the work on #22420 fixed this.
I would say, see if Tom can reproduce, if he can't either let's call it good.
Still failing 30/30 times for me at 7301a282a5. Starting with a fresh run-tests tempdir doesn't help, still fails 30/30 times. Same VM as #note-7, same stock python 3.11.
$ dpkg-query --show python3.11
python3.11 3.11.2-6+deb12u5
I made a new VM and using the same stock Python as Tom I still can't reproduce the failure after multiple attempts.
At this point I'm gonna just start digging into the code but for posterity I've attached all the versions of Debian packages installed on the system as well as PyPI packages installed in the test VENV3DIR
.
I am putting together a matrix but the short update is I have figured out that the difference is down to the underlying filesystem of $TMPDIR
. It mostly passes is $TMPDIR
is tmpfs, and mostly fails if $TMPDIR
is btrfs or ext4.
btrfs TMPDIR - Failed 50/50 times but I have seen it pass occasionally
ext4 TMPDIR - Failed 50/50 times, I have never seen it pass
tmpfs TMPDIR - Failed 0/50 times
Current hypothesis is that the test is expecting a particular order of operations from the kernel and it gets that on tmpfs but doesn't on other filesystems (maybe with more recent kernels?).
- Status changed from New to In Progress
You can get the test to pass if you insert time.sleep(1)
in between the two file creations at the top of the test method. Current theory is that on typical deployments of real disk filesystems, mtimes are crushed a little bit to preserve disk wear. When this happens, the code under test will see them as equal, and may choose to delete the opposite of the intended cache file.
20909-keep-disk-cache-test-mtime @ 16b7169b5a96d369c3da1520d3ff5f19aca230cf - developer-run-tests: #4665 
- All agreed upon points are implemented / addressed.
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- Code is tested and passing, both automated and manual, what manual testing was done is described
- See above. On my system this specific test now passes 50/50 runs when backed by an ext4
$TMPDIR
where it previously failed 50/50.
- Documentation has been updated.
- Behaves appropriately at the intended scale (describe intended scale).
- Considered backwards and forwards compatibility issues between client and server.
- Follows our coding standards and GUI style guidelines.
- Subject changed from PySDK tests.test_keep_client.KeepDiskCacheTestCase.test_disk_cache_cap fails on Debian 12 to PySDK tests.test_keep_client.KeepDiskCacheTestCase.test_disk_cache_cap fails on Debian 12 with a "real" $TMPDIR filesystem
LGTM. On my VM just now (ext4):
- main passed 1/50
- 20909-keep-disk-cache-test-mtime passed 50/50
- Status changed from In Progress to Resolved
Also available in: Atom
PDF