Project

General

Profile

Actions

Bug #19872

closed

Too many open files error

Added by Peter Amstutz about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
Start date:
12/13/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Running a test case with running "zcat" on a bunch of gzipped fastq files, it is getting the following error after a while:

2022-12-12 17:12:52 arvados.api[8871] DEBUG: [req-49kou2g0vgw59vt7dbt1] Retrying API request in 4 s after socket error
Traceback (most recent call last):
  File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/arvados_python_client-2.5.0.dev20221202182828-py3.8.egg/arvados/api.py", line 88, in _intercept_http_request
  File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/httplib2-0.20.1-py3.8.egg/httplib2/__init__.py", line 1711, in request
  File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/httplib2-0.20.1-py3.8.egg/httplib2/__init__.py", line 1427, in _request
  File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/httplib2-0.20.1-py3.8.egg/httplib2/__init__.py", line 1349, in _conn_request
  File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/httplib2-0.20.1-py3.8.egg/httplib2/__init__.py", line 1125, in connect
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/socket.py", line 918, in getaddrinfo
OSError: [Errno 24] Too many open files

The most likely explanation is that Python isn't garbage collecting the mmap'd keep cache blocks as expected -- need to investigate.


Files


Subtasks 1 (0 open1 closed)

Task #19879: review 19872-mnt-cache-limitsResolvedPeter Amstutz12/13/2022

Actions
Actions #1

Updated by Peter Amstutz about 2 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 2 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 2 months ago

  • Release set to 47
Actions #4

Updated by Peter Amstutz about 2 months ago

19872-mnt-cache-limits @ 9f7c39451c16003c6c6e0fb8de5a990781cb300f

  • Reduce max slots to 3/8 max fds instead of 1/2 because mmap() uses a
    second file descriptor, and we keep the original file descriptor open
    for flock()
  • Rework how cache slots are allocated to try evicting things before
    allocating a new cache slot, so the cache should be somewhat better
    behaved about staying within its configured limits.

developer-run-tests: #3419

Actions #6

Updated by Tom Clegg about 2 months ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz about 2 months ago

As a side effect, I think I confirmed that Python is garbage collecting things as expected, so the only real problem was that I was not aware that every mmap() allocates an additional file descriptor. As a side effect, the upper limit on the cache is now effectively now 24 GiB instead of 32 GiB.

This can potentially be increased by increasing RLIMIT_NOFILE. Maybe we want to call setrlimit and adjust that up to 2048 or 4096?

Actions #8

Updated by Peter Amstutz about 2 months ago

Oh yea, the user that reported this issue tested it again with the packages above and reported it was fixed for them.

Actions #9

Updated by Tom Clegg about 1 month ago

Everyone seems to agree the default/typical NOFILE limit of 1024 is too low. Consuming 3/4 of them seems like a bit much. Having a client library adjust NOFILE seems a little bit weird but at least arv-mount could raise NOFILE limit to 10240 if it's lower than that (and log a warning if it can't be raised?), and sdk/python could limit _max_slots to NOFILE/8. That would leave us with max 128-block / 8 GiB cache for most callers that don't adjust their NOFILE==1024, which doesn't seem so bad, and probably max 80 GiB for arv-mount, which seems like plenty.

Ideally we would be able to control fd usage by closing the files without deleting them -- especially in the case where any process with NOFILE=1024 immediately deletes lot of cache blocks that other processes with higher limits could still be using -- but it looks like that would involve more refactoring than it's worth at this point.

So, still room to improve, but even the current version is worth merging.

Actions #10

Updated by Peter Amstutz about 1 month ago

19872-mnt-cache-limits @ 4a832a93cd0baf253575936a79f83bcc4f666a82

module default is NOFILE/8 (so it consumes up to 1/4 of available file descriptors)

arv-mount adjust rlimit to 10240

developer-run-tests: #3424

Actions #11

Updated by Tom Clegg about 1 month ago

LGTM, thanks!

Actions #12

Updated by Peter Amstutz about 1 month ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF