Project

General

Profile

Actions

Bug #21640

closed

Controller uses >64K file descriptors, causing cluster outage

Added by Brett Smith 8 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Story points:
-
Release:
Release relationship:
Auto

Description

User's Arvados controller started failing with this error on all incoming requests:

Mar 27 17:31:08 arvados-controller[3502095]: 2024/03/27 17:31:08 http: Accept error: accept tcp 127.0.0.1:8003: accept4: too many open files; retrying in 1s

User deployed with our provided arvados-controller.service definition, which sets LimitNOFILE=65536. Controller apparently exceeded this limit. This seems to correlate with when Keep balance starts working.

User has had the cluster for over a year and only encountered this issue when running a workflow with a particularly large multithreaded job. The error has happened twice with this job in the mix (not necessarily the same container, the workflow is also being debugged).

I note this LimitNOFILE documentation:

Don't use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux. Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults. Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, i.e. do not use select(2). Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit. Use MemoryMax= to control overall service memory use, including file descriptor memory.

  • Does controller use select? Is it possible we're exceeding 1024 open connections, and then hitting that documented limit?
  • Is it possible the controller is leaking file descriptors? I can go get other logs if someone can give me at least a general sense of what I should be looking for.
  • If we're not using select and we think everything is behaved, I think it would be a nice change to have controller raise its own soft limit as suggested.

Things to investigate

  • Did we actually have 64k file descriptors open, or was the limit actually lower. I could believe hitting a 1k or 8k limit but 64k seems like a lot.
  • Does the Go runtime mess with 'nofile'
  • Do we have a file descriptor leak somewhere?
  • Do we have metrics like number of open files and current limit

Files


Subtasks 1 (0 open1 closed)

Task #21758: Review 21640-max-nofileResolvedTom Clegg05/24/2024Actions
Actions

Also available in: Atom PDF