Bug #10584
closed
[FUSE] high memory consumption (possible leak) in long-running arv-mount
Added by Joshua Randall about 8 years ago.
Updated 9 months ago.
Description
We have a (little used) arv-mount that has been running since 6th September.
It was started with the command line:
`/usr/bin/python2.7 /usr/bin/arv-mount /tmp/keep_jr17`
Since no `--file-cache` or `--directory-cache` options were given, those should have been the defaults of 256MiB and 128MiB. If I start a new arv-cache also with defaults and then read some large data through it and exercise some large directories (such as doing a find in `by_tag`), I am able to get memory usage up to 514MB, which seems reasonable.
However, the arv-mount that has been running for the past 77 days is now taking up 15GB of RAM!
I suspect this issue might be related to the increasing memory usage I observed and reported in #10535 when the python SDK test suite got stuck in a tight PollClient loop forever (where "forever" is until it ran the system out of memory).
- Subject changed from high memory consumption (possible leak) in long-running arv-mount to [FUSE] high memory consumption (possible leak) in long-running arv-mount
- Target version set to Arvados Future Sprints
- Target version changed from Arvados Future Sprints to 2017-07-05 sprint
- Assigned To set to Peter Amstutz
Some theories:
- This might be related/due to https://dev.arvados.org/issues/11158 → it is that it is trying to enumerate the entire home directory and it uses up all memory trying to store the full contents.
- Cache management clears releases unused Collection objects. However, those Collection objects may have prefetch threads. If they don't get stopped, they will leak.
*
- Target version changed from 2017-07-05 sprint to 2017-07-19 sprint
10584-fuse-stop-threads
Ensure get/put threads are stopped before releasing reference to Collection object. Unclear if this is the source of the problem, but seems like a good idea regardless.
The thread stopping code was added on a CollectionDirectoryBase
subclass, is it possible for this problem to happen with TmpCollectionDirectory
objects too? Maybe it’s better to do the thread stopping on CollectionDirectoryBase
?
Lucas Di Pentima wrote:
The thread stopping code was added on a CollectionDirectoryBase
subclass, is it possible for this problem to happen with TmpCollectionDirectory
objects too? Maybe it’s better to do the thread stopping on CollectionDirectoryBase
?
CollectionDirectoryBase
objects are used to hold Subcollection
objects, which don't have a stop_threads()
method.
TmpCollectionDirectory are not candidates for cache eviction (persisted() is False). The finalize()
method already calls stop_threads()
.
The difference between clear()
and finalize()
is that clear()
is called when we want to evict an inode's cached contents, whereas finalize()
is called when the inode will be deleted entirely.
Ok, so this looks good to me. Thanks!
- Target version changed from 2017-07-19 sprint to 2017-08-02 sprint
- Look at user interaction history with keep
- Track metrics
- Instrumentation to report memory usage / ownership
- Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
- Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint
- Assigned To deleted (
Peter Amstutz)
- Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint
- Target version changed from 2017-09-13 Sprint to Arvados Future Sprints
Might be worth running the "retry PUT" test many times in a row. At least once I've seen the test suite get stuck there using lots of memory.
- Target version deleted (
Arvados Future Sprints)
- Target version set to Future
- Target version deleted (
Future)
- Status changed from New to Closed
Closed as out of date, but recent improvements to arv-mount have improved the memory usage over time.
Also available in: Atom
PDF