Bug #20637
Updated by Peter Amstutz about 1 year ago
User container is getting FUSE errors, the error message in arv-mount.txt is "Failed to connect to 172.17.0.1 port 36323: Connection refused"
In addition crunch-run.txt is reporting "error updating log collection: error recording logs: Could not write sufficient replicas ... dial tcp 172.17.0.1:36323 conne" (presumably connection refused but the message is truncated)
This is with a local compute node keepstore. The keepstore service had to be working initially because it was able to load the docker image and write the initial log collection snapshot. Subsequently it has not been able to update the log collection with the error above.
This suggests the keepstore service crashed. startLocalKeepstore uses the health check to determine when the service has started, but does not set up an ongoing watchdog to ensure the service continues to be available.
Is it possible some kind of connectivity issue could cause keepstore to quit?
Also, keepstore didn't log anything, which is mysterious, I seem to recall an issue a few months ago with the logging level being too quiet by default?
Update:
Ran the workflow again and looked into it.
There is a very large number of collections, several 1000
arv-mount is reporting "can't start new thread"
And then later on, it starts reporting "Connection refused"
I think this is what is happening:
# Each collection gets its own instance of BlockManager
# Each BlockManager has its own pool of "put" threads and "get" threads (prefetch)
# If the maximum threads is ~4096 and there are > 2000 collections, it will eventually run out of threads
# Meanwhile, all those threads are making connections to keepstore
## They _should_ be using the same keep client but ???
## If all the "user agents" are tied up, it'll allocate a new one
# If all those threads and all those connections have HTTP keepalive, they eventually use up the ~4096 connections that keepstore can have by default, resulting in "connection refused" errors.
Solution:
The keepclient needs to be shared (it should already but double check)
The "get" and "put" thread pools should be shared (new behavior, maybe the thread pool moves to the Keep client).
Need to identify if there are any resource leaks related to lingering connections.