Bug #6762
open[FUSE][Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow
Description
I made a 'stressTest' pipeline to test the performance of the keep mounts.
pipeline | su92l-d1hrv-sk7cyv84p16nw9d |
time | 22h 13m |
input collection | 69414010a3d0f286ad6eb5a578801aa1+11278592 |
input collection size | 961G |
nodes | 16 |
cores per node | 8 |
The input collection has about 400 directories. Each directory has about 850 files, with each file around 1M. The files should have been uploaded in order so they should be either in the same block or nearby (though I haven't confirmed this) when they're in the same directory.
The pipeline spawns one task per directory and each task walks the files in the directory taking their md5sum. The md5sums are written to an output file which is then stored as the output collection.
From the above you can see that the 'throughput' for reading from keep is:
((961 GiB) * (1024 MiB/GiB)) / ((22 Hours) * (60 Minutes/Hour) * (60 Seconds/Minute)) / (128 Tasks)
Which gives approximately .1MiB/s
(per task) or about 100KiB/s
(per task).
From various discussions there are hypothesized to be two main bottlenecks:
- "Hot" keep server holds all relevant blocks and gets all requests
- Python SDK slow
As of this writing, it looks like the Python SDK (which the fuse mount uses in one aspect or another) gets around 10-20MiB/s instead of the theoretical 260MiB/s. Of the 9 current keep servers on su92l, only 1 is thought to be used by this collection which would mean the load isn't distributed.
As an aside, even if each of the ~1MiB files were not block distance near each other and we assumed each file request could be satisfied in 1s (which could be done at 10MiB/s keep performance, say), that would give 433*864/(128*60) ~ 49 Minutes
and not the 22 Hours
shown.
Updated by Tom Clegg over 9 years ago
- perhaps the job is pathological (filenames sorted differently when reading than when writing, e.g., 1,2,3 on the way in, 1,10,100,11 on the way out?)
- perhaps the python SDK block cache is buggy?
- perhaps we need #3640
Keepstore disk → 128 clients should certainly sustain way more than 12.8 MiB/s on su92l.
Updated by Brett Smith over 9 years ago
This is a very noisy experiment, enough so that I wonder how much it tells us about any one component.
The job is a shell script that calls a second shell script that's basically a utility wrapper around the Python SDK. Here's a list of all the shell commands in the job script that make API requests, and what those are:
- arv-dax setup: Get the current task
- arv-dax script_parameters: Get the current job
- arv-dax task sequence: Get the current task
- arv-dax task parameters: Get the current task
- find: Get the 11MiB stress directory collection (via FUSE)
- arv-dax task finish: Create a collection; get the current task; update the current task
You could literally go from eight API calls to four just by fetching the current task once, and fetching data from its individual fields as needed.
It's also possible that the discovery document isn't being cached between these commands. If that's the case, then the number of API requests is almost doubled: each item in the list above needs to fetch the discovery document, then make its request(s).
The job has 16 * 8 == 128 tasks in flight simultaneously most of the time. We already know that API server response times degrade noticeably as request parallelism goes up; see #5901. I think it's very possible that the API server is at least as much your problem as Keep itself, if not more.
I don't deny that people are going to write job scripts like this, or that this job shouldn't run faster. But from this test, I think it's premature to conclude that there's a major problem with Keep performance.
Updated by Brett Smith over 9 years ago
- Subject changed from Keep performance is very slow to [Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow
Updated by Brett Smith over 9 years ago
Brett Smith wrote:
It's also possible that the discovery document isn't being cached between these commands. If that's the case, then the number of API requests is almost doubled: each item in the list above needs to fetch the discovery document, then make its request(s).
This was bad speculation. crunch-job sets $HOME to be the same as $TASK_WORK, so even a job running inside a Docker image can write here no problem.
Updated by Tom Morris almost 8 years ago
- Subject changed from [Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow to [FUSE][Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow
- Category set to Performance
- Target version set to Arvados Future Sprints
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)