Project

General

Profile

Actions

Bug #21640

open

Controller uses >64K file descriptors, causing cluster outage

Added by Brett Smith 28 days ago. Updated about 4 hours ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Story points:
-

Description

User's Arvados controller started failing with this error on all incoming requests:

Mar 27 17:31:08 arvados-controller[3502095]: 2024/03/27 17:31:08 http: Accept error: accept tcp 127.0.0.1:8003: accept4: too many open files; retrying in 1s

User deployed with our provided arvados-controller.service definition, which sets LimitNOFILE=65536. Controller apparently exceeded this limit. This seems to correlate with when Keep balance starts working.

User has had the cluster for over a year and only encountered this issue when running a workflow with a particularly large multithreaded job. The error has happened twice with this job in the mix (not necessarily the same container, the workflow is also being debugged).

I note this LimitNOFILE documentation:

Don't use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux. Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults. Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, i.e. do not use select(2). Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit. Use MemoryMax= to control overall service memory use, including file descriptor memory.

  • Does controller use select? Is it possible we're exceeding 1024 open connections, and then hitting that documented limit?
  • Is it possible the controller is leaking file descriptors? I can go get other logs if someone can give me at least a general sense of what I should be looking for.
  • If we're not using select and we think everything is behaved, I think it would be a nice change to have controller raise its own soft limit as suggested.
Actions #3

Updated by Brett Smith 28 days ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz about 4 hours ago

  • Target version set to Development 2024-05-22 sprint
Actions

Also available in: Atom PDF