Project

General

Profile

Actions

Bug #20637

closed

Large number of collections ties up all connections?

Added by Peter Amstutz over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Story points:
-
Release relationship:
Auto

Description

User container is getting FUSE errors, the error message in arv-mount.txt is "Failed to connect to 172.17.0.1 port 36323: Connection refused"

In addition crunch-run.txt is reporting "error updating log collection: error recording logs: Could not write sufficient replicas ... dial tcp 172.17.0.1:36323 conne" (presumably connection refused but the message is truncated)

This is with a local compute node keepstore. The keepstore service had to be working initially because it was able to load the docker image and write the initial log collection snapshot. Subsequently it has not been able to update the log collection with the error above.

This suggests the keepstore service crashed. startLocalKeepstore uses the health check to determine when the service has started, but does not set up an ongoing watchdog to ensure the service continues to be available.

Is it possible some kind of connectivity issue could cause keepstore to quit?

Also, keepstore didn't log anything, which is mysterious, I seem to recall an issue a few months ago with the logging level being too quiet by default?

Update:

Ran the workflow again and looked into it.

There is a very large number of collections, several 1000

arv-mount is reporting "can't start new thread"

And then later on, it starts reporting "Connection refused"

I think this is what is happening:

  1. Each collection gets its own instance of BlockManager
  2. Each BlockManager has its own pool of "put" threads and "get" threads (prefetch)
  3. If the maximum threads is ~4096 and there are > 2000 collections, it will eventually run out of threads
  4. Meanwhile, all those threads are making connections to keepstore
    1. They should be using the same keep client but ???
    2. If all the "user agents" are tied up, it'll allocate a new one
  5. If all those threads and all those connections have HTTP keepalive, they eventually use up the ~4096 connections that keepstore can have by default, resulting in "connection refused" errors.

Solution:

The keepclient needs to be shared (it should already but double check)

The "get" and "put" thread pools should be shared (new behavior, maybe the thread pool moves to the Keep client).

Need to identify if there are any resource leaks related to lingering connections, e.g. when FUSE evicts a collection from the cache, it should make sure the block manager is shut down.


Files

makecollections.py (824 Bytes) makecollections.py Peter Amstutz, 06/16/2023 05:44 PM

Subtasks 1 (0 open1 closed)

Task #20657: Review 20637-prefetch-threadsResolvedPeter Amstutz06/16/2023Actions
Actions

Also available in: Atom PDF