Project

General

Profile

Actions

Bug #20637

closed

Large number of collections ties up all connections?

Added by Peter Amstutz 11 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Story points:
-
Release relationship:
Auto

Description

User container is getting FUSE errors, the error message in arv-mount.txt is "Failed to connect to 172.17.0.1 port 36323: Connection refused"

In addition crunch-run.txt is reporting "error updating log collection: error recording logs: Could not write sufficient replicas ... dial tcp 172.17.0.1:36323 conne" (presumably connection refused but the message is truncated)

This is with a local compute node keepstore. The keepstore service had to be working initially because it was able to load the docker image and write the initial log collection snapshot. Subsequently it has not been able to update the log collection with the error above.

This suggests the keepstore service crashed. startLocalKeepstore uses the health check to determine when the service has started, but does not set up an ongoing watchdog to ensure the service continues to be available.

Is it possible some kind of connectivity issue could cause keepstore to quit?

Also, keepstore didn't log anything, which is mysterious, I seem to recall an issue a few months ago with the logging level being too quiet by default?

Update:

Ran the workflow again and looked into it.

There is a very large number of collections, several 1000

arv-mount is reporting "can't start new thread"

And then later on, it starts reporting "Connection refused"

I think this is what is happening:

  1. Each collection gets its own instance of BlockManager
  2. Each BlockManager has its own pool of "put" threads and "get" threads (prefetch)
  3. If the maximum threads is ~4096 and there are > 2000 collections, it will eventually run out of threads
  4. Meanwhile, all those threads are making connections to keepstore
    1. They should be using the same keep client but ???
    2. If all the "user agents" are tied up, it'll allocate a new one
  5. If all those threads and all those connections have HTTP keepalive, they eventually use up the ~4096 connections that keepstore can have by default, resulting in "connection refused" errors.

Solution:

The keepclient needs to be shared (it should already but double check)

The "get" and "put" thread pools should be shared (new behavior, maybe the thread pool moves to the Keep client).

Need to identify if there are any resource leaks related to lingering connections, e.g. when FUSE evicts a collection from the cache, it should make sure the block manager is shut down.


Files

makecollections.py (824 Bytes) makecollections.py Peter Amstutz, 06/16/2023 05:44 PM

Subtasks 1 (0 open1 closed)

Task #20657: Review 20637-prefetch-threadsResolvedPeter Amstutz06/16/2023Actions
Actions #1

Updated by Peter Amstutz 11 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 11 months ago

  • Status changed from In Progress to New
  • Category set to Crunch
Actions #3

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
  • Subject changed from Watchdog for compute node keepstore to Large number of collections ties up all connections?
Actions #7

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz 11 months ago

  • Category changed from Crunch to FUSE
Actions #9

Updated by Peter Amstutz 11 months ago

  • Assigned To set to Peter Amstutz
Actions #10

Updated by Peter Amstutz 11 months ago

Script used to create a bunch of collections that contain a file with multiple blocks.

Actions #11

Updated by Peter Amstutz 11 months ago

  • Status changed from New to In Progress
Actions #12

Updated by Peter Amstutz 11 months ago

20637-prefetch-threads @ 4e4c935d6fddb68997a50a382bff01c223dd00df

This avoids the problem of every Collection with its own BlockManager
creating its own prefetch thread pool, which becomes a resource leak
when reading files from 1000s of separate Collection objects.

The 'put' thread pool remains with the BlockManager but it now stops
the put threads on 'BlockManager.commit_all'. This is because this
method always flushes pending blocks anyway, and is called before the
collection record is written to the API server -- so we can assume
we've just finished a batch of writes to that collection, and might
not need the put thread pool any more, and if we do, we can just make
a new one.

developer-run-tests: #3708

developer-run-tests-apps-workbench-integration: #4009

Actions #13

Updated by Tom Clegg 10 months ago

This LGTM, thanks.

Actions #14

Updated by Peter Amstutz 10 months ago

  • Status changed from In Progress to Resolved
Actions #15

Updated by Peter Amstutz 8 months ago

  • Release set to 66
Actions

Also available in: Atom PDF