Bug #10629
closed[FUSE] slow enumerating files by collection uuid
Description
It takes 38 minutes (!!) to enumerate 242K files. During this time the arv-mount process is pegged at 100% CPU and no network traffic is being done. The manifest text totals just 9MB (9243914 char), so it's taking over 4 min / MB to parse this.
$ time find keep/by_id/e51c5-4zz18-l3dq8bw20uwz0qd -print | wc -l
241751
real 38m0.969s
user 0m0.224s
sys 0m0.300s
$ wc *.manifest
2497 252893 9243914 e51c5-4zz18-l3dq8bw20uwz0qd.manifest
Updated by Tom Clegg about 8 years ago
- debug trace - which / how many fuse operations per file
- make a test manifest big enough to exhibit slowness (10K files?), and try
- squishing dir hierarchy (is slowness related to dir depth?)
- all files in one dir (is slowness related to # files per dir?)
- double the # files and see how that affects timing (is it O(N) or O(N^2)?)
Updated by Tom Morris almost 8 years ago
- Assigned To set to Peter Amstutz
- Target version changed from Arvados Future Sprints to 2017-03-01 sprint
Updated by Peter Amstutz almost 8 years ago
peteramstutz@shell:~$ time find keep/by_id/83325435ac6cf1a851f4e1aadf4df0e3+8675570 -print | wc -l 241751 real 0m3.723s user 0m0.124s sys 0m0.180s
So the problem is that there is different behavior for collections accessed by UUID vs. by PDH. It seems to be doing some expensive synchronization operation which is elided for PDH (which is immutable).
Updated by Peter Amstutz almost 8 years ago
- Subject changed from arv-mount pathologically slow enumerating files to [FUSE] slow enumerating files by collection uuid
Updated by Peter Amstutz almost 8 years ago
- Status changed from New to In Progress
class Handle(object): """Connects a numeric file handle to a File or Directory object that has been opened by the client.""" def flush(self): if self.obj.writable(): return self.obj.flush()
Several problems here.
- Opendir and releasedir are only ever used to get the directory listing via readdir(). Because a directory handle isn't used to modify the directory, calling flush() is spurious.
- If the Operations() object was created with enable_write=False, calling flush() is spurious.
- The CollectionDirectory object is considered "writable" despite enable_write=False
- Finally, computing committed() (to decide whether to actually send and updated manifest to the server) checks the _committed flag on every object. When there are 240000 files, that is expensive (especially because it makes a function call and increments/decrements a recursive mutex at each node.)
Todo:
- Fix set_committed() to accept True or False and propagate the flag up or down accordingly. Change committed() to only test the local flag.
- Don't flush directory handles at all.
Updated by Peter Amstutz almost 8 years ago
10629-fuse-listing-perf ready for review.
Updated by Lucas Di Pentima almost 8 years ago
LGTM.
Local sdk/python
and services/fuse
tests ran without issues.
Tried to do some benchmarking using arvbox but wasn't able to start it, I don't want to stall this review any longer, if you have timings after the fix it would be nice to have the comparison here.
Updated by Peter Amstutz almost 8 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:9b2aa42213a7d333bbe93e040c2d152a70e9b5af.