Bug #6762
open[FUSE][Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow
Description
I made a 'stressTest' pipeline to test the performance of the keep mounts.
pipeline | su92l-d1hrv-sk7cyv84p16nw9d |
time | 22h 13m |
input collection | 69414010a3d0f286ad6eb5a578801aa1+11278592 |
input collection size | 961G |
nodes | 16 |
cores per node | 8 |
The input collection has about 400 directories. Each directory has about 850 files, with each file around 1M. The files should have been uploaded in order so they should be either in the same block or nearby (though I haven't confirmed this) when they're in the same directory.
The pipeline spawns one task per directory and each task walks the files in the directory taking their md5sum. The md5sums are written to an output file which is then stored as the output collection.
From the above you can see that the 'throughput' for reading from keep is:
((961 GiB) * (1024 MiB/GiB)) / ((22 Hours) * (60 Minutes/Hour) * (60 Seconds/Minute)) / (128 Tasks)
Which gives approximately .1MiB/s
(per task) or about 100KiB/s
(per task).
From various discussions there are hypothesized to be two main bottlenecks:
- "Hot" keep server holds all relevant blocks and gets all requests
- Python SDK slow
As of this writing, it looks like the Python SDK (which the fuse mount uses in one aspect or another) gets around 10-20MiB/s instead of the theoretical 260MiB/s. Of the 9 current keep servers on su92l, only 1 is thought to be used by this collection which would mean the load isn't distributed.
As an aside, even if each of the ~1MiB files were not block distance near each other and we assumed each file request could be satisfied in 1s (which could be done at 10MiB/s keep performance, say), that would give 433*864/(128*60) ~ 49 Minutes
and not the 22 Hours
shown.