Project

General

Profile

Actions

Feature #18961

closed

Go FileSystem / FUSE mount supports block prefetch

Added by Peter Amstutz about 2 years ago. Updated 12 days ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
FUSE
Story points:
2.0

Description

Use the following strategy for prefetch:

When a read happens on a file, look at the next N blocks the make up the manifest stream and issue prefetch requests for those blocks. These blocks get loaded into the cache so they are ready to go when they are needed.

By looking ahead in the stream rather than just the file, this also works for manifests containing small files which are stored as 1 block per file.

There should be a config knob to control how much data or blocks are prefetched so that sites can experiment with optimal throughput.

This implies the cache behavior needs to support pre-fetch, which means pre-fetched blocks should not push out actively used blocks but should be able to push out less recently used blocks. Plain LRU behavior where a block is promoted each time it is accessed may be sufficient but metrics will be helpful.


Files

download-speed-1.8-GB-file.png (11.6 KB) download-speed-1.8-GB-file.png Tom Clegg, 02/20/2024 02:48 PM
download-speed-1.8-GB-file-2.png (29.7 KB) download-speed-1.8-GB-file-2.png Tom Clegg, 02/26/2024 02:54 PM
download-speed-1800-1MB-files.png (22.8 KB) download-speed-1800-1MB-files.png Tom Clegg, 02/26/2024 08:03 PM
download-speed-1.8-GB-file-4.png (24.8 KB) download-speed-1.8-GB-file-4.png Tom Clegg, 02/26/2024 09:26 PM
smol.png (20.1 KB) smol.png Tom Clegg, 02/27/2024 02:55 AM
smol2.png (21.2 KB) smol2.png Tom Clegg, 02/27/2024 03:40 PM
smol3.png (22.1 KB) smol3.png Tom Clegg, 02/27/2024 04:21 PM
smol4.png (30.5 KB) smol4.png Tom Clegg, 02/27/2024 05:45 PM

Related issues

Related to Arvados Epics - Idea #17849: FUSE driver v2NewActions
Related to Arvados Epics - Idea #18342: Keep performance optimizationNew08/01/202305/30/2024Actions
Related to Arvados - Feature #20995: Prefetch small files when scanning a collection directoryDuplicateTom CleggActions
Related to Arvados - Feature #21606: configurable keep-web output buffer to reduce delay between blocksResolvedTom CleggActions
Actions #1

Updated by Peter Amstutz about 2 years ago

Actions #2

Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2022-05-11 sprint to 2022-05-25 sprint
Actions #3

Updated by Peter Amstutz almost 2 years ago

  • Target version deleted (2022-05-25 sprint)
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Release set to 60
Actions #5

Updated by Peter Amstutz about 1 year ago

  • Target version set to Future
Actions #6

Updated by Peter Amstutz about 1 year ago

  • Release deleted (60)
  • Subject changed from Go FileSystem / FUSE mount supports block caching & prefetch to Go FileSystem / FUSE mount supports block prefetch
Actions #8

Updated by Peter Amstutz about 1 year ago

  • Story points set to 2.0
  • Target version changed from Future to To be scheduled
  • Description updated (diff)
Actions #9

Updated by Peter Amstutz 12 months ago

  • Target version changed from To be scheduled to Development 2023-05-10 sprint
Actions #10

Updated by Tom Clegg 12 months ago

  • Assigned To set to Tom Clegg
Actions #11

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2023-05-10 sprint to Development 2023-05-24 sprint
Actions #12

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2023-05-24 sprint to Development 2023-06-07
Actions #13

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2023-06-07 to Development 2023-06-21 sprint
Actions #14

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2023-06-21 sprint to To be scheduled
Actions #16

Updated by Peter Amstutz 10 months ago

  • Related to Idea #18342: Keep performance optimization added
Actions #17

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)
Actions #18

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)
Actions #19

Updated by Peter Amstutz 7 months ago

  • Target version changed from To be scheduled to Development 2023-10-25 sprint
Actions #20

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2023-10-25 sprint to Development 2023-11-08 sprint
Actions #21

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2023-11-08 sprint to Development 2023-11-29 sprint
Actions #22

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2023-11-29 sprint to Development 2024-01-03 sprint
Actions #23

Updated by Peter Amstutz 5 months ago

  • Target version changed from Development 2024-01-03 sprint to Development 2024-01-17 sprint
Actions #24

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-01-17 sprint to Development 2024-01-31 sprint
Actions #25

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Actions #26

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-01-31 sprint
Actions #27

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-01-31 sprint to Development 2024-02-14 sprint
Actions #28

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2024-02-14 sprint to Development 2024-02-28 sprint
Actions #29

Updated by Peter Amstutz 2 months ago

  • Related to Feature #20995: Prefetch small files when scanning a collection directory added
Actions #30

Updated by Tom Clegg 2 months ago

Results of a few trials using a simplistic implementation with the easiest sample data (one big file, optimal block packing) and a small cache (big enough to accommodate pre-fetched blocks, but not big enough to retain data from one trial to the next):

(misleading chart removed)

"prefetch 0.5" starts pre-fetching the next block when the client has read 50% of the current block.

"no stream" trials use the current main version of keepstore.

"stream" trials use the unmerged version of keepstore from #2960, 62168c2db5.

Actions #31

Updated by Peter Amstutz 2 months ago

What's "prefetch 1" and "prefetch 2" ?

Actions #32

Updated by Peter Amstutz 2 months ago

Also I'm a little confused how to read this, if the lines in a box-and-whisker plot are supposed to be minimum and maximum how do you have lines that don't touch the box?

Actions #33

Updated by Tom Clegg about 2 months ago

improved chart

prefetch 0.5: prefetch block N+1 when reading 2nd half of block N

prefetch 1: prefetch block N+1 when reading block N

prefetch 2: prefetch block N+1 and N+2 when reading block N

Actions #34

Updated by Tom Clegg about 2 months ago

  • Status changed from New to In Progress
Actions #35

Updated by Tom Clegg about 2 months ago

  • File download-speed-1800-1MB-files.png added

Small file download performance (using sequential curl invocations), 18961-block-prefetch @ 1dcde0921d vs. main

"prefetch 1 easy" is b1bd2898c1, with the easy prefetch implementation that only prefetches the next block in the current file / doesn't try to predict which file will be read next.

Actions #36

Updated by Tom Clegg about 2 months ago

  • File deleted (download-speed-1800-1MB-files.png)
Actions #37

Updated by Tom Clegg about 2 months ago

  • File download-speed-1800-1MB-files.png added
Actions #38

Updated by Peter Amstutz about 2 months ago

This charts make more sense than the old ones, but can you verify what the boxes/whiskers represent here? Ar the whiskers the full range and the box is the 25th and 75th percentiles? Where's the mean and median?

Is the 1800x1MB test is fetching 1800 blocks or fewer blocks with sequential packing?

What is the disk cache setting?

Actions #39

Updated by Peter Amstutz about 2 months ago

Also this is download speed per file (average) or time to download all the files overall?

Tentatively, it looks like 1 block prefetch may be slightly slower than no prefetch, but also has less variance. However I think it also depends on how many trials you did and exactly what this is measuring. I think we need to go deep into the numbers here and make sure we understand exactly what is happening.

Actions #40

Updated by Tom Clegg about 2 months ago

  • File deleted (download-speed-1800-1MB-files.png)
Actions #42

Updated by Tom Clegg about 2 months ago

The boxes show Q1 and Q3, whiskers show min and max, mean is not shown.

Y axis is overall speed for a sequence of 1800 x 1 MB downloads (i.e., 1800 ÷ clock seconds).

The disk cache is about 1.2 GB, just small enough that the cache gets turned over from one trial to the next.

The manifest is well-packed (blocks are 64 MiB).

My conclusions:
  • other variables (cloud weather / AWS S3's own caching?) are significant
  • the "easy" version of prefetch might make small file performance slightly worse than no-prefetch
  • the latest / full version of prefetch looks best (possible, but unlikely, that the other unrelated variables just happened to work in its favor)

I did another set of large file trials with the latest version. I expected it would be slightly worse if anything (the new code does more work per read), but instead it looked better and one trial did exceptionally well. More than anything else it hints that we'll need more samples / better strategy to get convincing numbers. If we're going to do that, it might make more sense to do it on a more powerful VM.

Actions #43

Updated by Peter Amstutz about 2 months ago

I just wrote a bunch of comments and it ate them when I hit save...

What are the instance types of keep-web and the shell node where you are doing the downloading? We should probably have them both be something like m5n.large.

How many trials are you running? When you showed the data the other day, you had 5 data points. To increase confidence we should run like 20+ trials.

For the small files, a couple of thoughts:

  • If I'm reading this right it is running about 1/2 to 1/3 the single file transfer rate, that makes me suspect it is being dominated by connection setup and/or TCP slow start. I'd be curious what the difference is if the same sequential download was done in a single process using a single HTTP session.
  • I would like to see a test where the manifest has 1 block per file. To me, the goal of small file prefetch is to improve performance for manifests that are not packed -- so we should have numbers about how it performs in that case.

I'm also curious if less cache (600 MB instead of 1200 MB) makes any difference. Presumably it shouldn't but I think it would be a useful number.

We should also run trials where there is enough cache (2+ GB) so that we have an idea of how it performs in the best case.

Actions #44

Updated by Tom Clegg about 2 months ago

using 16x concurrent curl processes (xargs -P 16)

Actions #45

Updated by Tom Clegg about 2 months ago

For comparison: a single 969M file + warm cache = 530 to 630 MB/s.

Actions #46

Updated by Peter Amstutz about 2 months ago

Tom Clegg wrote in #note-44:

using 16x concurrent curl processes (xargs -P 16)

16 concurrent curl processes could be stepping on each other. What if the client is reading each file in sequence?

What order is it reading files? Is it favorable for prefetch or counter productive?

Can we do a trial where we shuffle the access order randomly?

Also, I'd still like to see a version of this that uses a single TCP session to see if that meaningfully minimizes overhead from connection setup and TCP slow start.

Actions #47

Updated by Tom Clegg about 2 months ago

Same results as #note-44 plus new results for
  • latest version with "small file prefetch" disabled ("easyprefetch")
  • latest version with "small file prefetch" limited to the 1st segment of the next file in the stream ("prefetch-1seg")

Evidently, the initial "optimize for small files" prefetch implementation (prefetch up to 64 MiB past the current read point) performs worse in this particular "small files" scenario.

Even prefetching 1 block seems to be slightly detrimental (prefetching 2 blocks was slightly worse than 1). But perhaps it's helpful with different network/backend performance characteristics?

Actions #48

Updated by Tom Clegg about 2 months ago

Actions #49

Updated by Brett Smith about 2 months ago

I'm all for measuring things, but aren't all these performance numbers necessarily affected by disk and network performance? Which will vary across installs and applications? I'm a little wary of overoptimizing our general strategy based on the performance numbers from one specific setup.

Actions #50

Updated by Tom Clegg about 2 months ago

For these trials I used xargs -n 16 to reduce client-side overhead. This improves the overall transfer time, but it still shows the "prefetch for small files" feature (even if limited to 1, 2, or 4 blocks after the current file) giving slightly worse performance than the simpler "prefetch for large files" feature.

My suspicion is that if the download requests don't arrive in exactly the same order they were stored in the manifest then the demand on keepweb<->keepstore<->s3 gets lumpier, and therefore more likely to be affected by network/service limits.

In that case prefetch for small files might be productive only when there is more keepstore and s3/backend capacity.

Actions #51

Updated by Peter Amstutz about 2 months ago

What is the algorithm difference between "streaming+prefetch" and "streaming+easyprefetch" ?

If prefetch seems to be a loser, maybe we shouldn't do it at all? Have we found any cases where it clearly beats simple streaming?

Actions #52

Updated by Tom Clegg about 2 months ago

Peter Amstutz wrote in #note-51:

What is the algorithm difference between "streaming+prefetch" and "streaming+easyprefetch" ?

streaming+easyprefetch prefetches the next blocks in the current file until it is 64 MiB ahead.

streaming+prefetch prefetches the next blocks in the current file, then the next blocks in the lexically-next file(s) in the directory, until it is 64 MiB ahead.

streaming+prefetch-Nseg prefetches the next blocks in the current file, then up to N blocks in the lexically-next file(s) in the directory, until it is 64 MiB ahead.

If prefetch seems to be a loser, maybe we shouldn't do it at all? Have we found any cases where it clearly beats simple streaming?

#note-42 suggests prefetch might increase the maximum download speed for large files.

Generally, prefetch helps when backend latency is high, but it seems like backend latency is not in fact high now that keepstore itself doesn't add a store-and-forward delay.

Perhaps a better feature is a configurable-size output buffer in keep-web, so (provided the backend throughput is faster than the client throughput, which is also the only situation in which prefetch can help) the client-side buffer drains while the backend is waiting for the next block. Besides using up some more memory, I don't think this would be worse than the current behavior in any situation.

This doesn't help small files at all, but neither does prefetch, so....

Actions #53

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2024-02-28 sprint to Development 2024-03-13 sprint
Actions #54

Updated by Tom Clegg about 1 month ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #55

Updated by Tom Clegg about 1 month ago

  • Related to Feature #21606: configurable keep-web output buffer to reduce delay between blocks added
Actions #56

Updated by Tom Clegg about 1 month ago

  • Status changed from In Progress to Closed
Actions

Also available in: Atom PDF