[FUSE] arv-mount is too slow for strelka
Strelka seems to be CPU bound. When running strelka without using arv-mount, it will fully utilize all 16 cores at 100%:
When we are using arv-mount, even with 4 threads, full CPU usage is not used:
#1 Updated by Peter Amstutz almost 4 years ago
I think there's two separate issues here.
The first is that there is definitely cache thrashing going on. The default arv-mount cache is 256 MiB (4 blocks) which means with even 4 readers it's easy to spill out a block that another reader is going to use again. Block prefetching (intended to optimize long sequential reads) may also causes unnecessary cache spills.
The second problem is that arv-mount is CPU bound. We haven't diagnosed precisely where the time goes, but it is likely that there is overhead from marshalling between Python and C code, lock contention due to the GIL and the llfuse lock, and the constant factor performance penalties of using a interpreted language are multiplied over many reads:
- It 16384 4k reads to read a 64 MiB block
- If it takes 2 ms to service a read in Python and 1 ms to service read in a compiled language, then it will take 4 seconds to read the block in a compiled language and 8 seconds to read the block in Python. Over 16384 reads that's an extra 4 seconds which may the same or longer than it took to fetch the block over the network in the first place!
#2 Updated by Tom Clegg almost 4 years ago
I agree, clearly we're seeing cache thrashing. In this case, it seems to me the overhead of servicing 4k reads can easily be dwarfed by the cost of cache misses.Python+fuse overhead:
- Say each 4 KiB read takes 2ms, with 4 concurrent readers.
- 64 MiB block = 16384 4 KiB reads; 16384 * 2ms per read = 32s per block read by caller
- (we have seen FUSE do better than 128 MiB/s; was that doing 256K reads * 2ms each? If it was 4K reads, we can service a read in 30us. Surely Python isn't so bad at threading that 4 threads do 1/16 as much total work as 1 thread?)
- Say each reader reads 256 KiB (64 * 4k reads) before the relevant 64 MiB block gets ejected from cache in order to serve one of the other readers.
- Say we read from Keep at a total of 256 MiB/s = 64 MiB/s * 4 readers.
- 64 MiB block = 256 cache misses; 64 MiB * 256 network transfers = 16 GiB network traffic = 256s per block read by caller
- 64 MiB block = 16384 cache misses; 64 MiB * 16384 network transfers = 1 TiB network traffic = 4096s per block read by caller
arv-mount statistics (#3137) should make it easy to see how much network traffic a job is generating per byte read by callers. That should help point us in the right direction.
Supporting partial-block reads all the way down the stack (#3734 plus clients) is the only way I can see to support this kind of workload effectively without consuming excessive RAM on the client end. With a bit of client logic and fetching each block a total of 2x, we can still do client-side data integrity checking.
Until then: use a bigger cache, or fewer concurrent readers per arv-mount process.
#3 Updated by Peter Amstutz almost 4 years ago
I suggest collecting some hard data. We should compare times for transferring a single medium sized file (5-10 GiB) to a crunch node by various methods:
- shell 'cp' from mount
- Python shutil.copy from mount
- A Python loop that copies from a CollectionReader() to an open file
- A Go program similar to arv-get that uses the new CollectionFileReader in the Go SDK
#4 Updated by Peter Amstutz almost 4 years ago
- I never disagreed that cache misses dominate over single-read overhead. However, I argue that when cache misses are not the problem, the overhead on individual reads is dominating over the time spent actually fetching the block. E.g. we might download the block is 500ms but spend 2000ms servicing reads, which means throughput is CPU limited and not I/O limited. That's unfortunate.
- The largest read size supported by FUSE is 128k (enabled with a special option, which we set), but in practice it seems lots of applications only read 4k at a time.
- It's entirely possible the code is doing something stupid that can be optimized, but that requires investing time into profiling with uncertain payoff (since there are also known bottlenecks we can't do anything about).