[FUSE] slow enumerating files by collection uuid
|Assignee:||Peter Amstutz||% Done:|
|Target version:||2017-03-01 sprint|
|Story points||1.0||Remaining (hours)||0.00 hour|
|Velocity based estimate||0 days|
It takes 38 minutes (!!) to enumerate 242K files. During this time the arv-mount process is pegged at 100% CPU and no network traffic is being done. The manifest text totals just 9MB (9243914 char), so it's taking over 4 min / MB to parse this.
$ time find keep/by_id/e51c5-4zz18-l3dq8bw20uwz0qd -print | wc -l
$ wc *.manifest
2497 252893 9243914 e51c5-4zz18-l3dq8bw20uwz0qd.manifest
- debug trace - which / how many fuse operations per file
- make a test manifest big enough to exhibit slowness (10K files?), and try
- squishing dir hierarchy (is slowness related to dir depth?)
- all files in one dir (is slowness related to # files per dir?)
- double the # files and see how that affects timing (is it O(N) or O(N^2)?)
#4 Updated by Peter Amstutz 8 months ago
peteramstutz@shell:~$ time find keep/by_id/83325435ac6cf1a851f4e1aadf4df0e3+8675570 -print | wc -l 241751 real 0m3.723s user 0m0.124s sys 0m0.180s
So the problem is that there is different behavior for collections accessed by UUID vs. by PDH. It seems to be doing some expensive synchronization operation which is elided for PDH (which is immutable).
#6 Updated by Peter Amstutz 8 months ago
- Status changed from New to In Progress
class Handle(object): """Connects a numeric file handle to a File or Directory object that has been opened by the client.""" def flush(self): if self.obj.writable(): return self.obj.flush()
Several problems here.
- Opendir and releasedir are only ever used to get the directory listing via readdir(). Because a directory handle isn't used to modify the directory, calling flush() is spurious.
- If the Operations() object was created with enable_write=False, calling flush() is spurious.
- The CollectionDirectory object is considered "writable" despite enable_write=False
- Finally, computing committed() (to decide whether to actually send and updated manifest to the server) checks the _committed flag on every object. When there are 240000 files, that is expensive (especially because it makes a function call and increments/decrements a recursive mutex at each node.)
- Fix set_committed() to accept True or False and propagate the flag up or down accordingly. Change committed() to only test the local flag.
- Don't flush directory handles at all.
#8 Updated by Lucas Di Pentima 8 months ago
services/fuse tests ran without issues.
Tried to do some benchmarking using arvbox but wasn't able to start it, I don't want to stall this review any longer, if you have timings after the fix it would be nice to have the comparison here.