Bug #21640
closedController uses >64K file descriptors, causing cluster outage
Description
User's Arvados controller started failing with this error on all incoming requests:
Mar 27 17:31:08 arvados-controller[3502095]: 2024/03/27 17:31:08 http: Accept error: accept tcp 127.0.0.1:8003: accept4: too many open files; retrying in 1s
User deployed with our provided arvados-controller.service
definition, which sets LimitNOFILE=65536
. Controller apparently exceeded this limit. This seems to correlate with when Keep balance starts working.
User has had the cluster for over a year and only encountered this issue when running a workflow with a particularly large multithreaded job. The error has happened twice with this job in the mix (not necessarily the same container, the workflow is also being debugged).
I note this LimitNOFILE
documentation:
Don't use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux. Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults. Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, i.e. do not use select(2). Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit. Use MemoryMax= to control overall service memory use, including file descriptor memory.
- Does controller use
select
? Is it possible we're exceeding 1024 open connections, and then hitting that documented limit? - Is it possible the controller is leaking file descriptors? I can go get other logs if someone can give me at least a general sense of what I should be looking for.
- If we're not using
select
and we think everything is behaved, I think it would be a nice change to have controller raise its own soft limit as suggested.
Things to investigate¶
- Did we actually have 64k file descriptors open, or was the limit actually lower. I could believe hitting a 1k or 8k limit but 64k seems like a lot.
- Does the Go runtime mess with 'nofile'
- Do we have a file descriptor leak somewhere?
- Do we have metrics like number of open files and current limit
Files
Updated by Peter Amstutz 8 months ago
- Target version set to Development 2024-05-22 sprint
Updated by Peter Amstutz 7 months ago
- Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Updated by Tom Clegg 7 months ago
Typical system default is LimitNOFILE=1024:524288
(soft:hard).
1024 was too low, so in 73018eed5cd93c9db7467760f5f1484fa2750554 (Feb 2020) we added LimitNOFILE=65536
to our systemd service files.
1024 was too low for everybody else too, so starting with Go 1.19 (Aug 2022) any Go program that imports the os
package automatically raises its own softlimit to hardlimit. (See Go release notes, Go commit, systemd advice "Typically applications should increase their soft limit to the hard limit on their own")
So now our LimitNOFILE=65536
solution is effectively reducing the limit from 524288 to 65536.
The default hard limit is 524288 on debian:11, debian:12, ubuntu:20.04, ubuntu:22.04.
So, I think the most obvious improvement is to remove LimitNOFILE
from our service files, and enjoy the default limit of 524288.
21640-max-nofile @ e672f484160faae900fb7f7e281d06952fd35d28 -- developer-run-tests: #4245
Updated by Tom Clegg 7 months ago
- Status changed from New to In Progress
- File node_filefd_allocated_2023_2024.png node_filefd_allocated_2023_2024.png added
- File node_filefd_allocated_2023-11-15.png node_filefd_allocated_2023-11-15.png added
The Prometheus node exporter provides a per-host node_filefd_allocated
metric. Our own dev/test/playground clusters report no more than 30K open files on any host (let alone single service) at any point in the last year.
That one big blip looks a little suspicious:
But I haven't figured out what caused it. Arvados doesn't seem to have been especially busy at that time.
Updated by Brett Smith 7 months ago
Maybe a lesson for future work here is: if you want to increase a soft limit, make sure to write it as soft:hard
unless you specifically want to lower the hard limit. It sounds like this issue probably wouldn't have even come up if we had originally written LimitNOFILE=65536:524288
in our service definitions.
Updated by Brett Smith 7 months ago
21640-max-nofile @ e672f484160faae900fb7f7e281d06952fd35d28
For posterity, yes I'm reviewing this without the checklist, because I confirm it is a single commit that only removes configuration from systemd service units. There's basically no point in running tests because nothing is actually going to test this, there's no style guide to follow, etc.
Also just for posterity, I think it's worth calling out:
diff --git a/services/crunch-dispatch-local/crunch-dispatch-local.service b/services/crunch-dispatch-local/crunch-dispatch-local.service
index b4fc10f83e..f40359de2e 100644
--- a/services/crunch-dispatch-local/crunch-dispatch-local.service
+++ b/services/crunch-dispatch-local/crunch-dispatch-local.service
@@ -11,11 +11,8 @@ StartLimitIntervalSec=0
Restart=always
RestartSec=1
-LimitNOFILE=1000000
[Install]
This change actually does lower the limit for crunch-dispatch-local, because now we're going from 1 million to 512 kibi-file-handles. I'm okay with that: this was included in the original commit of the file, so there's no history to explain it, and I think if 512Ki wasn't good enough for us we would see that in other services like controller before we saw it in crunch-dispatch-local.
As long as we're remotely vaguely on the same page about that, looks good to me.
Updated by Tom Clegg 7 months ago
21640-max-nofile @ e672f484160faae900fb7f7e281d06952fd35d28 -- developer-run-tests: #4245
(Turns out I did run the tests but missed posting the link.)
Updated by Tom Clegg 7 months ago
Brett Smith wrote in #note-12:
This change actually does lower the limit for crunch-dispatch-local
Agree. Given absence of commentary in git, I'm assuming 1000000 was chosen the same way 65536 was chosen ("a number much bigger than 1024") and it doesn't indicate any suspicion that the default hard limit 524288 was too low.
Updated by Tom Clegg 7 months ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|42900afaf6a23009c4f8e7559e0559c198cf8ddf.