Bug #10584

[FUSE] high memory consumption (possible leak) in long-running arv-mount

Added by Joshua Randall 8 months ago. Updated 5 days ago.

Status:NewStart date:11/22/2016
Priority:NormalDue date:
Assignee:Peter Amstutz% Done:

100%

Category:-
Target version:2017-08-02 sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

We have a (little used) arv-mount that has been running since 6th September.

It was started with the command line:
`/usr/bin/python2.7 /usr/bin/arv-mount /tmp/keep_jr17`

Since no `--file-cache` or `--directory-cache` options were given, those should have been the defaults of 256MiB and 128MiB. If I start a new arv-cache also with defaults and then read some large data through it and exercise some large directories (such as doing a find in `by_tag`), I am able to get memory usage up to 514MB, which seems reasonable.

However, the arv-mount that has been running for the past 77 days is now taking up 15GB of RAM!

I suspect this issue might be related to the increasing memory usage I observed and reported in #10535 when the python SDK test suite got stuck in a tight PollClient loop forever (where "forever" is until it ran the system out of memory).


Subtasks

Task #11890: Review 10584-fuse-stop-threadsResolvedPeter Amstutz

Associated revisions

Revision 1e31815d
Added by Peter Amstutz 17 days ago

Merge branch '10584-fuse-stop-threads' refs #10584

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Tom Morris 5 months ago

  • Subject changed from high memory consumption (possible leak) in long-running arv-mount to [FUSE] high memory consumption (possible leak) in long-running arv-mount
  • Target version set to Arvados Future Sprints

#2 Updated by Tom Morris about 1 month ago

  • Target version changed from Arvados Future Sprints to 2017-07-05 sprint

#3 Updated by Peter Amstutz about 1 month ago

  • Assignee set to Peter Amstutz

#4 Updated by Peter Amstutz 22 days ago

Some theories:

  • This might be related/due to https://dev.arvados.org/issues/11158 → it is that it is trying to enumerate the entire home directory and it uses up all memory trying to store the full contents.
  • Cache management clears releases unused Collection objects. However, those Collection objects may have prefetch threads. If they don't get stopped, they will leak. *

#5 Updated by Peter Amstutz 19 days ago

  • Target version changed from 2017-07-05 sprint to 2017-07-19 sprint

#6 Updated by Peter Amstutz 19 days ago

10584-fuse-stop-threads

Ensure get/put threads are stopped before releasing reference to Collection object. Unclear if this is the source of the problem, but seems like a good idea regardless.

#7 Updated by Lucas Di Pentima 18 days ago

The thread stopping code was added on a CollectionDirectoryBase subclass, is it possible for this problem to happen with TmpCollectionDirectory objects too? Maybe it’s better to do the thread stopping on CollectionDirectoryBase?

#8 Updated by Peter Amstutz 18 days ago

Lucas Di Pentima wrote:

The thread stopping code was added on a CollectionDirectoryBase subclass, is it possible for this problem to happen with TmpCollectionDirectory objects too? Maybe it’s better to do the thread stopping on CollectionDirectoryBase?

CollectionDirectoryBase objects are used to hold Subcollection objects, which don't have a stop_threads() method.

TmpCollectionDirectory are not candidates for cache eviction (persisted() is False). The finalize() method already calls stop_threads().

The difference between clear() and finalize() is that clear() is called when we want to evict an inode's cached contents, whereas finalize() is called when the inode will be deleted entirely.

#9 Updated by Lucas Di Pentima 17 days ago

Ok, so this looks good to me. Thanks!

#10 Updated by Tom Morris 5 days ago

  • Target version changed from 2017-07-19 sprint to 2017-08-02 sprint

Also available in: Atom PDF