Bug #22653
openLots of "cannot mmap empty file" FUSE errors on Jenkins
Description
We have started seeing this more ever since we did #22489 and #22579. It's unclear whether it's a side effect of the actual setup, or of increased reuse, or what. But here's a recent example from run-tests-remainder: #5030 :
FAIL: singularity_test.go:186: singularitySuite.TestImageCache_New building singularity image [singularity build /tmp/crunch-run-singularity-1282974358/image.sif docker-archive:///tmp/crunch-run-singularity-1282974358/image.tar] INFO: Starting build... Getting image source signatures Copying blob sha256:67f770da229bf16d0c280f232629b0c1f1243a884df09f6b940a1c7288535a6d Copying config sha256:a11e762410a6fb4e925d1ea535fecc177d983bdf0dba3261d244fb3c7ee18865 Writing manifest to image destination Storing signatures 2025/03/10 17:13:18 info unpack layer: sha256:378e3b9fb50c743e1daa7a79dc2cf7c18aa0ac8137a1ca0d51a3b909c80e7d48 INFO: Creating SIF file... INFO: Build complete: /tmp/crunch-run-singularity-1282974358/image.sif building singularity image [singularity build /tmp/check-505614902/43/by_uuid/zzzzz-4zz18-4al6gql7va4r8fz/image.sif docker-archive:///tmp/crunch-run-singularity-2250306490/image.tar] INFO: Starting build... Getting image source signatures Copying blob sha256:67f770da229bf16d0c280f232629b0c1f1243a884df09f6b940a1c7288535a6d Copying config sha256:a11e762410a6fb4e925d1ea535fecc177d983bdf0dba3261d244fb3c7ee18865 Writing manifest to image destination Storing signatures 2025/03/10 17:13:19 info unpack layer: sha256:378e3b9fb50c743e1daa7a79dc2cf7c18aa0ac8137a1ca0d51a3b909c80e7d48 INFO: Creating SIF file... FATAL: While performing build: while creating SIF: while creating container: open /tmp/check-505614902/43/by_uuid/zzzzz-4zz18-4al6gql7va4r8fz/image.sif: no such file or directory singularity_test.go:193: c.Check(err, IsNil) ... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc002079e60), Stderr:[]uint8(nil)} ("exit status 255") singularity_test.go:197: s.checkCacheCollectionExists(c, setup) singularity_test.go:180: if !c.Check(cl.Items, HasLen, 1) { ... } ... obtained []arvados.Collection = []arvados.Collection{} ... n int = 1 Traceback (most recent call last): File "/home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/_internal/diskcache.py", line 177, in get_from_disk content = mmap.mmap(filehandle.fileno(), 0, access=mmap.ACCESS_READ) ValueError: cannot mmap an empty file Traceback (most recent call last): File "/home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/_internal/diskcache.py", line 177, in get_from_disk content = mmap.mmap(filehandle.fileno(), 0, access=mmap.ACCESS_READ) ValueError: cannot mmap an empty file
At first this happened alongside out of space errors, and we increased the disk size for Jenkins worker nodes to deal with that. See #22579#note-7. But this specific error is still occurring. It is now occurring alongside ENOMEM errors. run-tests-doc-pysdk-api-fuse: #757 :
__________________ ERROR at setup of FuseMountTest_0.runTest ___________________ cls = <class 'tests.test_mount.FuseMountTest_0'> @classmethod def setUpClass(cls): if cls.disk_cache: cls._disk_cache_dir = tempfile.mkdtemp(prefix='MountTest-') else: cls._disk_cache_dir = None > cls._keep_block_cache = arvados.keep.KeepBlockCache( disk_cache=cls.disk_cache, disk_cache_dir=cls._disk_cache_dir, ) tests/mount_test_base.py:38: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/keep.py:189: in __init__ self._cache = diskcache.DiskCacheSlot.init_cache(self._disk_cache_dir, self._max_slots) /home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/_internal/diskcache.py:211: in init_cache ds.set(b'a') _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <arvados._internal.diskcache.DiskCacheSlot object at 0x7f2ba89009c0> value = b'a' def set(self, value): tmpfile = None try: if value is None: self.content = None self.ready.set() return False if len(value) == 0: # Can't mmap a 0 length file self.content = b'' self.ready.set() return True if self.content is not None: # Has been set already self.ready.set() return False blockdir = os.path.join(self.cachedir, self.locator[0:3]) os.makedirs(blockdir, mode=0o700, exist_ok=True) final = os.path.join(blockdir, self.locator) + cacheblock_suffix self.filehandle = tempfile.NamedTemporaryFile(dir=blockdir, delete=False, prefix="tmp", suffix=cacheblock_suffix) tmpfile = self.filehandle.name os.chmod(tmpfile, stat.S_IRUSR | stat.S_IWUSR) # aquire a shared lock, this tells other processes that # we're using this block and to please not delete it. fcntl.flock(self.filehandle, fcntl.LOCK_SH) self.filehandle.write(value) self.filehandle.flush() os.rename(tmpfile, final) tmpfile = None > self.content = mmap.mmap(self.filehandle.fileno(), 0, access=mmap.ACCESS_READ) E OSError: [Errno 12] Cannot allocate memory /home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/_internal/diskcache.py:83: OSError
Also:
__________________________ FuseWriteFileTest.runTest ___________________________ self = <tests.test_mount.FuseWriteFileTest testMethod=runTest> def runTest(self): collection = arvados.collection.Collection(api_client=self.api) collection.save_new() > m = self.make_mount(fuse.CollectionDirectory) tests/test_mount.py:468: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ tests/mount_test_base.py:86: in make_mount self.operations = fuse.Operations( arvados_fuse/__init__.py:624: in __init__ self.inodes = Inodes(inode_cache, encoding=encoding, fsns=fsns, arvados_fuse/__init__.py:317: in __init__ self._inode_remove_thread.start() _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <Thread(Thread-79, initial daemon)> def start(self): """Start the thread's activity. It must be called at most once per thread object. It arranges for the object's run() method to be invoked in a separate thread of control. This method will raise a RuntimeError if called more than once on the same thread object. """ if not self._initialized: raise RuntimeError("thread.__init__() not called") if self._started.is_set(): raise RuntimeError("threads can only be started once") with _active_limbo_lock: _limbo[self] = self try: > _start_new_thread(self._bootstrap, ()) E RuntimeError: can't start new thread /usr/lib/python3.9/threading.py:874: RuntimeError
This seems to basically only affect arv-mount. It feels like there's some error happening earlier that gets ignored/suppressed/whatever, and then it ends up failing this way. Looking for the "original" error might be productive.
Updated by Brett Smith 22 days ago
For now as a mitigation I have configured Jenkins test nodes to be oneshot nodes; i.e., they will not be reused.
With that change, I have reverted the disk size increase #22579#note-7 and set that back to 40, since the main (only?) reason to increase the disk was to accommodate more reuse.