Project

General

Profile

Actions

Bug #22653

open

Lots of "cannot mmap empty file" FUSE errors on Jenkins

Added by Brett Smith 22 days ago. Updated 22 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

We have started seeing this more ever since we did #22489 and #22579. It's unclear whether it's a side effect of the actual setup, or of increased reuse, or what. But here's a recent example from run-tests-remainder: #5030 :

FAIL: singularity_test.go:186: singularitySuite.TestImageCache_New

building singularity image
[singularity build /tmp/crunch-run-singularity-1282974358/image.sif docker-archive:///tmp/crunch-run-singularity-1282974358/image.tar]
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:67f770da229bf16d0c280f232629b0c1f1243a884df09f6b940a1c7288535a6d
Copying config sha256:a11e762410a6fb4e925d1ea535fecc177d983bdf0dba3261d244fb3c7ee18865
Writing manifest to image destination
Storing signatures
2025/03/10 17:13:18  info unpack layer: sha256:378e3b9fb50c743e1daa7a79dc2cf7c18aa0ac8137a1ca0d51a3b909c80e7d48
INFO:    Creating SIF file...
INFO:    Build complete: /tmp/crunch-run-singularity-1282974358/image.sif

building singularity image
[singularity build /tmp/check-505614902/43/by_uuid/zzzzz-4zz18-4al6gql7va4r8fz/image.sif docker-archive:///tmp/crunch-run-singularity-2250306490/image.tar]
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:67f770da229bf16d0c280f232629b0c1f1243a884df09f6b940a1c7288535a6d
Copying config sha256:a11e762410a6fb4e925d1ea535fecc177d983bdf0dba3261d244fb3c7ee18865
Writing manifest to image destination
Storing signatures
2025/03/10 17:13:19  info unpack layer: sha256:378e3b9fb50c743e1daa7a79dc2cf7c18aa0ac8137a1ca0d51a3b909c80e7d48
INFO:    Creating SIF file...
FATAL:   While performing build: while creating SIF: while creating container: open /tmp/check-505614902/43/by_uuid/zzzzz-4zz18-4al6gql7va4r8fz/image.sif: no such file or directory

singularity_test.go:193:
    c.Check(err, IsNil)
... value *exec.ExitError = &exec.ExitError{ProcessState:(*os.ProcessState)(0xc002079e60), Stderr:[]uint8(nil)} ("exit status 255")

singularity_test.go:197:
    s.checkCacheCollectionExists(c, setup)
singularity_test.go:180:
    if !c.Check(cl.Items, HasLen, 1) {
        ...
    }
... obtained []arvados.Collection = []arvados.Collection{}
... n int = 1

Traceback (most recent call last):
  File "/home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/_internal/diskcache.py", line 177, in get_from_disk
    content = mmap.mmap(filehandle.fileno(), 0, access=mmap.ACCESS_READ)
ValueError: cannot mmap an empty file
Traceback (most recent call last):
  File "/home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/_internal/diskcache.py", line 177, in get_from_disk
    content = mmap.mmap(filehandle.fileno(), 0, access=mmap.ACCESS_READ)
ValueError: cannot mmap an empty file

At first this happened alongside out of space errors, and we increased the disk size for Jenkins worker nodes to deal with that. See #22579#note-7. But this specific error is still occurring. It is now occurring alongside ENOMEM errors. run-tests-doc-pysdk-api-fuse: #757 :

__________________ ERROR at setup of FuseMountTest_0.runTest ___________________

cls = <class 'tests.test_mount.FuseMountTest_0'>

    @classmethod
    def setUpClass(cls):
        if cls.disk_cache:
            cls._disk_cache_dir = tempfile.mkdtemp(prefix='MountTest-')
        else:
            cls._disk_cache_dir = None
>       cls._keep_block_cache = arvados.keep.KeepBlockCache(
            disk_cache=cls.disk_cache,
            disk_cache_dir=cls._disk_cache_dir,
        )

tests/mount_test_base.py:38: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/keep.py:189: in __init__
    self._cache = diskcache.DiskCacheSlot.init_cache(self._disk_cache_dir, self._max_slots)
/home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/_internal/diskcache.py:211: in init_cache
    ds.set(b'a')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <arvados._internal.diskcache.DiskCacheSlot object at 0x7f2ba89009c0>
value = b'a'

    def set(self, value):
        tmpfile = None
        try:
            if value is None:
                self.content = None
                self.ready.set()
                return False

            if len(value) == 0:
                # Can't mmap a 0 length file
                self.content = b''
                self.ready.set()
                return True

            if self.content is not None:
                # Has been set already
                self.ready.set()
                return False

            blockdir = os.path.join(self.cachedir, self.locator[0:3])
            os.makedirs(blockdir, mode=0o700, exist_ok=True)

            final = os.path.join(blockdir, self.locator) + cacheblock_suffix

            self.filehandle = tempfile.NamedTemporaryFile(dir=blockdir, delete=False, prefix="tmp", suffix=cacheblock_suffix)
            tmpfile = self.filehandle.name
            os.chmod(tmpfile, stat.S_IRUSR | stat.S_IWUSR)

            # aquire a shared lock, this tells other processes that
            # we're using this block and to please not delete it.
            fcntl.flock(self.filehandle, fcntl.LOCK_SH)

            self.filehandle.write(value)
            self.filehandle.flush()
            os.rename(tmpfile, final)
            tmpfile = None

>           self.content = mmap.mmap(self.filehandle.fileno(), 0, access=mmap.ACCESS_READ)
E           OSError: [Errno 12] Cannot allocate memory

/home/jenkins/tmp/VENV3DIR/lib/python3.9/site-packages/arvados/_internal/diskcache.py:83: OSError

Also:

__________________________ FuseWriteFileTest.runTest ___________________________

self = <tests.test_mount.FuseWriteFileTest testMethod=runTest>

    def runTest(self):
        collection = arvados.collection.Collection(api_client=self.api)
        collection.save_new()

>       m = self.make_mount(fuse.CollectionDirectory)

tests/test_mount.py:468: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/mount_test_base.py:86: in make_mount
    self.operations = fuse.Operations(
arvados_fuse/__init__.py:624: in __init__
    self.inodes = Inodes(inode_cache, encoding=encoding, fsns=fsns,
arvados_fuse/__init__.py:317: in __init__
    self._inode_remove_thread.start()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Thread(Thread-79, initial daemon)>

    def start(self):
        """Start the thread's activity.

        It must be called at most once per thread object. It arranges for the
        object's run() method to be invoked in a separate thread of control.

        This method will raise a RuntimeError if called more than once on the
        same thread object.

        """ 
        if not self._initialized:
            raise RuntimeError("thread.__init__() not called")

        if self._started.is_set():
            raise RuntimeError("threads can only be started once")

        with _active_limbo_lock:
            _limbo[self] = self
        try:
>           _start_new_thread(self._bootstrap, ())
E           RuntimeError: can't start new thread

/usr/lib/python3.9/threading.py:874: RuntimeError

This seems to basically only affect arv-mount. It feels like there's some error happening earlier that gets ignored/suppressed/whatever, and then it ends up failing this way. Looking for the "original" error might be productive.

Actions #1

Updated by Brett Smith 22 days ago

For now as a mitigation I have configured Jenkins test nodes to be oneshot nodes; i.e., they will not be reused.

With that change, I have reverted the disk size increase #22579#note-7 and set that back to 40, since the main (only?) reason to increase the disk was to accommodate more reuse.

Actions

Also available in: Atom PDF