keepstore no longer functioning with repeated errors: http: Accept error: accept tcp [::]:25107: accept4: too many open files; retrying in 1s
Every entry in the keepstore logs on one of my keepstores looks something like this:
2017-12-20_19:08:26.79910 2017/12/20 19:08:26 http: Accept error: accept tcp [::]:25107: accept4: too many open files; retrying in 1s
The keepstore is not accepting connections.
After a restart, the keepstore started functioning again.
ii keepstore 0.1.20171102145827.cc6f86f-1 amd64 Keep storage daemon, accessible to clients on the LAN
#1 Updated by Ward Vandewege 12 months ago
- Status changed from New to In Progress
- Assigned To set to Ward Vandewege
- Target version set to 2017-12-20 Sprint
We see this sometimes on heavily loaded keepstores.
On Debian(ish) systems, the default limit for file descriptors is 1024.
If you have a recent kernel and OS, you can use the `prlimit` tool to modify the file descriptor limit upwards for a running process.
Alternatively, ulimit can do it too at process start time.
Does this help?
#2 Updated by Joshua Randall 12 months ago
I suspect so, my fix was also to add `ulimit -n 1048576` to the runit script that starts keepstore, and the problem hasn't come back so far.
I reported it as a bug because (a) this is not documented in the install docs and (b) it seems like there is a problem with the way keepstore handles open files if it can get into a state where it fails to accept every single incoming connection for many days. If it just occasionally had a "too many open files" issue when under heavy load, but then later it started working again (presumably because it closed some sockets) then (b) would not be an issue. but it would still be good to have a doc fix for (a). As it is, it seems like something must be wrong with the way keepstore handles closing files in order for it to be possible to get into a state where it is perpetually out of file descriptors?
#6 Updated by Ward Vandewege 11 months ago
We suspect there is a bug here in an error path. We'll investigate.
In any case, perhaps we should increase the default value in the systemd config file at /lib/systemd/system/keepstore.service.
Perhaps 8192 is a better default:
# Copyright (C) The Arvados Authors. All rights reserved. # # SPDX-License-Identifier: AGPL-3.0 [Unit] Description=Arvados Keep Storage Daemon Documentation=https://doc.arvados.org/ After=network.target AssertPathExists=/etc/arvados/keepstore/keepstore.yml # systemd<230 StartLimitInterval=0 # systemd>=230 StartLimitIntervalSec=0 [Service] Type=notify LimitNOFILE=8192 ExecStart=/usr/bin/keepstore Restart=always RestartSec=1 [Install] WantedBy=multi-user.target
Overriding per system is also possible, one can put a file at /etc/systemd/system/keepstore.service.d/override.conf with these contents, for instance: