Bug #6762

[FUSE][Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow

Added by Abram Connelly about 4 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Performance
Target version:
Start date:
07/23/2015
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

I made a 'stressTest' pipeline to test the performance of the keep mounts.

pipeline su92l-d1hrv-sk7cyv84p16nw9d
time 22h 13m
input collection 69414010a3d0f286ad6eb5a578801aa1+11278592
input collection size 961G
nodes 16
cores per node 8

The input collection has about 400 directories. Each directory has about 850 files, with each file around 1M. The files should have been uploaded in order so they should be either in the same block or nearby (though I haven't confirmed this) when they're in the same directory.

The pipeline spawns one task per directory and each task walks the files in the directory taking their md5sum. The md5sums are written to an output file which is then stored as the output collection.

From the above you can see that the 'throughput' for reading from keep is:

((961 GiB) * (1024 MiB/GiB)) / ((22 Hours) * (60 Minutes/Hour) * (60 Seconds/Minute)) / (128 Tasks)

Which gives approximately .1MiB/s (per task) or about 100KiB/s (per task).

From various discussions there are hypothesized to be two main bottlenecks:

  • "Hot" keep server holds all relevant blocks and gets all requests
  • Python SDK slow

As of this writing, it looks like the Python SDK (which the fuse mount uses in one aspect or another) gets around 10-20MiB/s instead of the theoretical 260MiB/s. Of the 9 current keep servers on su92l, only 1 is thought to be used by this collection which would mean the load isn't distributed.

As an aside, even if each of the ~1MiB files were not block distance near each other and we assumed each file request could be satisfied in 1s (which could be done at 10MiB/s keep performance, say), that would give 433*864/(128*60) ~ 49 Minutes and not the 22 Hours shown.

History

#1 Updated by Tom Clegg about 4 years ago

We should figure out how many times each block is fetched during this job. For example, if the data is being read in an inconvenient order and the Python SDK block cache isn't doing well, it's conceivable each block is being transmitted over the network (say) 20 times. If all the data is stuck on one keepstore server, that would be ~250 MiB/s on the storage node, which would be reasonable. If this is what's happening:
  • perhaps the job is pathological (filenames sorted differently when reading than when writing, e.g., 1,2,3 on the way in, 1,10,100,11 on the way out?)
  • perhaps the python SDK block cache is buggy?
  • perhaps we need #3640

Keepstore disk → 128 clients should certainly sustain way more than 12.8 MiB/s on su92l.

#2 Updated by Brett Smith about 4 years ago

This is a very noisy experiment, enough so that I wonder how much it tells us about any one component.

The job is a shell script that calls a second shell script that's basically a utility wrapper around the Python SDK. Here's a list of all the shell commands in the job script that make API requests, and what those are:

  • arv-dax setup: Get the current task
  • arv-dax script_parameters: Get the current job
  • arv-dax task sequence: Get the current task
  • arv-dax task parameters: Get the current task
  • find: Get the 11MiB stress directory collection (via FUSE)
  • arv-dax task finish: Create a collection; get the current task; update the current task

You could literally go from eight API calls to four just by fetching the current task once, and fetching data from its individual fields as needed.

It's also possible that the discovery document isn't being cached between these commands. If that's the case, then the number of API requests is almost doubled: each item in the list above needs to fetch the discovery document, then make its request(s).

The job has 16 * 8 == 128 tasks in flight simultaneously most of the time. We already know that API server response times degrade noticeably as request parallelism goes up; see #5901. I think it's very possible that the API server is at least as much your problem as Keep itself, if not more.

I don't deny that people are going to write job scripts like this, or that this job shouldn't run faster. But from this test, I think it's premature to conclude that there's a major problem with Keep performance.

#3 Updated by Brett Smith about 4 years ago

  • Subject changed from Keep performance is very slow to [Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow

#4 Updated by Brett Smith about 4 years ago

Brett Smith wrote:

It's also possible that the discovery document isn't being cached between these commands. If that's the case, then the number of API requests is almost doubled: each item in the list above needs to fetch the discovery document, then make its request(s).

This was bad speculation. crunch-job sets $HOME to be the same as $TASK_WORK, so even a job running inside a Docker image can write here no problem.

#5 Updated by Tom Morris over 2 years ago

  • Subject changed from [Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow to [FUSE][Performance] Job to md5sum 1TiB of 1MiB files in parallel is very slow
  • Category set to Performance
  • Target version set to Arvados Future Sprints

Also available in: Atom PDF