Project

General

Profile

Actions

Bug #21640

closed

Controller uses >64K file descriptors, causing cluster outage

Added by Brett Smith 8 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Story points:
-
Release:
Release relationship:
Auto

Description

User's Arvados controller started failing with this error on all incoming requests:

Mar 27 17:31:08 arvados-controller[3502095]: 2024/03/27 17:31:08 http: Accept error: accept tcp 127.0.0.1:8003: accept4: too many open files; retrying in 1s

User deployed with our provided arvados-controller.service definition, which sets LimitNOFILE=65536. Controller apparently exceeded this limit. This seems to correlate with when Keep balance starts working.

User has had the cluster for over a year and only encountered this issue when running a workflow with a particularly large multithreaded job. The error has happened twice with this job in the mix (not necessarily the same container, the workflow is also being debugged).

I note this LimitNOFILE documentation:

Don't use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux. Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults. Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, i.e. do not use select(2). Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit. Use MemoryMax= to control overall service memory use, including file descriptor memory.

  • Does controller use select? Is it possible we're exceeding 1024 open connections, and then hitting that documented limit?
  • Is it possible the controller is leaking file descriptors? I can go get other logs if someone can give me at least a general sense of what I should be looking for.
  • If we're not using select and we think everything is behaved, I think it would be a nice change to have controller raise its own soft limit as suggested.

Things to investigate

  • Did we actually have 64k file descriptors open, or was the limit actually lower. I could believe hitting a 1k or 8k limit but 64k seems like a lot.
  • Does the Go runtime mess with 'nofile'
  • Do we have a file descriptor leak somewhere?
  • Do we have metrics like number of open files and current limit

Files


Subtasks 1 (0 open1 closed)

Task #21758: Review 21640-max-nofileResolvedTom Clegg05/24/2024Actions
Actions #3

Updated by Brett Smith 8 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 7 months ago

  • Target version set to Development 2024-05-22 sprint
Actions #5

Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz 7 months ago

  • Assigned To set to Tom Clegg
Actions #7

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Actions #8

Updated by Tom Clegg 6 months ago

Typical system default is LimitNOFILE=1024:524288 (soft:hard).

1024 was too low, so in 73018eed5cd93c9db7467760f5f1484fa2750554 (Feb 2020) we added LimitNOFILE=65536 to our systemd service files.

1024 was too low for everybody else too, so starting with Go 1.19 (Aug 2022) any Go program that imports the os package automatically raises its own softlimit to hardlimit. (See Go release notes, Go commit, systemd advice "Typically applications should increase their soft limit to the hard limit on their own")

So now our LimitNOFILE=65536 solution is effectively reducing the limit from 524288 to 65536.

The default hard limit is 524288 on debian:11, debian:12, ubuntu:20.04, ubuntu:22.04.

So, I think the most obvious improvement is to remove LimitNOFILE from our service files, and enjoy the default limit of 524288.

21640-max-nofile @ e672f484160faae900fb7f7e281d06952fd35d28 -- developer-run-tests: #4245

Actions #9

Updated by Tom Clegg 6 months ago

The Prometheus node exporter provides a per-host node_filefd_allocated metric. Our own dev/test/playground clusters report no more than 30K open files on any host (let alone single service) at any point in the last year.

That one big blip looks a little suspicious:

But I haven't figured out what caused it. Arvados doesn't seem to have been especially busy at that time.

Actions #10

Updated by Brett Smith 6 months ago

Maybe a lesson for future work here is: if you want to increase a soft limit, make sure to write it as soft:hard unless you specifically want to lower the hard limit. It sounds like this issue probably wouldn't have even come up if we had originally written LimitNOFILE=65536:524288 in our service definitions.

Actions #11

Updated by Tom Clegg 6 months ago

...or if systemd had offered a way to write "raise soft limit, leave hard limit alone".

Actions #12

Updated by Brett Smith 6 months ago

21640-max-nofile @ e672f484160faae900fb7f7e281d06952fd35d28

For posterity, yes I'm reviewing this without the checklist, because I confirm it is a single commit that only removes configuration from systemd service units. There's basically no point in running tests because nothing is actually going to test this, there's no style guide to follow, etc.

Also just for posterity, I think it's worth calling out:

diff --git a/services/crunch-dispatch-local/crunch-dispatch-local.service b/services/crunch-dispatch-local/crunch-dispatch-local.service
index b4fc10f83e..f40359de2e 100644
--- a/services/crunch-dispatch-local/crunch-dispatch-local.service
+++ b/services/crunch-dispatch-local/crunch-dispatch-local.service
@@ -11,11 +11,8 @@ StartLimitIntervalSec=0
 Restart=always
 RestartSec=1
-LimitNOFILE=1000000

 [Install]

This change actually does lower the limit for crunch-dispatch-local, because now we're going from 1 million to 512 kibi-file-handles. I'm okay with that: this was included in the original commit of the file, so there's no history to explain it, and I think if 512Ki wasn't good enough for us we would see that in other services like controller before we saw it in crunch-dispatch-local.

As long as we're remotely vaguely on the same page about that, looks good to me.

Actions #13

Updated by Tom Clegg 6 months ago

21640-max-nofile @ e672f484160faae900fb7f7e281d06952fd35d28 -- developer-run-tests: #4245

(Turns out I did run the tests but missed posting the link.)

Actions #14

Updated by Tom Clegg 6 months ago

Brett Smith wrote in #note-12:

This change actually does lower the limit for crunch-dispatch-local

Agree. Given absence of commentary in git, I'm assuming 1000000 was chosen the same way 65536 was chosen ("a number much bigger than 1024") and it doesn't indicate any suspicion that the default hard limit 524288 was too low.

Actions #15

Updated by Tom Clegg 6 months ago

  • Status changed from In Progress to Resolved
Actions #16

Updated by Peter Amstutz 3 months ago

  • Release set to 70
Actions

Also available in: Atom PDF