Project

General

Profile

Actions

Bug #12859

closed

keepstore no longer functioning with repeated errors: http: Accept error: accept tcp [::]:25107: accept4: too many open files; retrying in 1s

Added by Joshua Randall almost 5 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
11/01/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Every entry in the keepstore logs on one of my keepstores looks something like this:

2017-12-20_19:08:26.79910 2017/12/20 19:08:26 http: Accept error: accept tcp [::]:25107: accept4: too many open files; retrying in 1s

The keepstore is not accepting connections.

After a restart, the keepstore started functioning again.

Version:

ii  keepstore                        0.1.20171102145827.cc6f86f-1      amd64        Keep storage daemon, accessible to clients on the LAN


Subtasks 1 (0 open1 closed)

Task #18317: Review 12859-keepstore-fd-leakResolvedTom Clegg11/01/2021

Actions
Actions #1

Updated by Ward Vandewege almost 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Ward Vandewege
  • Target version set to 2017-12-20 Sprint

Hey Josh,

We see this sometimes on heavily loaded keepstores.

On Debian(ish) systems, the default limit for file descriptors is 1024.

If you have a recent kernel and OS, you can use the `prlimit` tool to modify the file descriptor limit upwards for a running process.

Alternatively, ulimit can do it too at process start time.

Does this help?

Thanks,
Ward.

Actions #2

Updated by Joshua Randall almost 5 years ago

I suspect so, my fix was also to add `ulimit -n 1048576` to the runit script that starts keepstore, and the problem hasn't come back so far.

I reported it as a bug because (a) this is not documented in the install docs and (b) it seems like there is a problem with the way keepstore handles open files if it can get into a state where it fails to accept every single incoming connection for many days. If it just occasionally had a "too many open files" issue when under heavy load, but then later it started working again (presumably because it closed some sockets) then (b) would not be an issue. but it would still be good to have a doc fix for (a). As it is, it seems like something must be wrong with the way keepstore handles closing files in order for it to be possible to get into a state where it is perpetually out of file descriptors?

Actions #3

Updated by Tom Morris almost 5 years ago

  • Assigned To deleted (Ward Vandewege)
  • Target version changed from 2017-12-20 Sprint to 2018-01-17 Sprint
Actions #4

Updated by Tom Morris almost 5 years ago

  • Target version changed from 2018-01-17 Sprint to Arvados Future Sprints
Actions #6

Updated by Ward Vandewege almost 5 years ago

We suspect there is a bug here in an error path. We'll investigate.

In any case, perhaps we should increase the default value in the systemd config file at /lib/systemd/system/keepstore.service.

Perhaps 8192 is a better default:

# Copyright (C) The Arvados Authors. All rights reserved.
#
# SPDX-License-Identifier: AGPL-3.0

[Unit]
Description=Arvados Keep Storage Daemon
Documentation=https://doc.arvados.org/
After=network.target
AssertPathExists=/etc/arvados/keepstore/keepstore.yml
# systemd<230
StartLimitInterval=0
# systemd>=230
StartLimitIntervalSec=0

[Service]
Type=notify
LimitNOFILE=8192
ExecStart=/usr/bin/keepstore
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target

Overriding per system is also possible, one can put a file at /etc/systemd/system/keepstore.service.d/override.conf with these contents, for instance:

[Service]
LimitNOFILE=8192
Actions #7

Updated by Peter Amstutz over 1 year ago

  • Target version deleted (Arvados Future Sprints)
Actions #9

Updated by Tom Clegg about 1 year ago

  • Target version set to 2021-11-10 sprint
  • Assigned To set to Tom Clegg
Actions #10

Updated by Ward Vandewege about 1 year ago

Tom Clegg wrote:

12859-keepstore-fd-leak @ 9ff58642c2c1eaae8d6dd7c01e8fb85a030c2dfc -- developer-run-tests: #2762

Only one question, on line 295 in `services/keepstore/unix_volume.go`, is there a reason to keep that explicit call to tmpfile.Close() ? We now have a new `defer tmpfile.Close()` higher up, so...

LGTM, otherwise!

Actions #11

Updated by Tom Clegg about 1 year ago

Ward Vandewege wrote:

is there a reason to keep that explicit call to tmpfile.Close() ? We now have a new `defer tmpfile.Close()` higher up, so...

Only to detect errors (unflushed write?). The deferred Close just covers the return-early-because-we-hit-some-other-error-first case, where we don't care whether Close fails.

Actions #12

Updated by Tom Clegg about 1 year ago

As noted in the commit message, 9ff58642c fixes a file descriptor leak when a filesystem-backed volume has Serialize: true and the client hangs up while waiting for another operation to finish on the same volume. When clients are piling up faster than keepstore can serve requests, many of the clients would time out at that stage, so this seems like a plausible explanation for the reported problem.

Actions #13

Updated by Tom Clegg about 1 year ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|13dea78ff8ad72bd8a406843bb702aef0d0f569a.

Actions #14

Updated by Peter Amstutz about 1 year ago

  • Release set to 45
Actions

Also available in: Atom PDF