Project

General

Profile

Actions

Feature #22320

closed

Add Repack(opts RepackOptions) method to collectionfs, dirnode, and filehandle

Added by Tom Clegg 5 months ago. Updated 11 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Story points:
-

Description

From Efficient block packing for small WebDAV uploads
  • filehandle method only needs to be supported when target is a dirnode (repacking a file could be useful, e.g., fuse driver, but not needed for webdav)
  • traverse dir/filesystem, finding opportunities to merge small (<32MiB) blocks into larger (>=32MiB) blocks
  • optionally (opts.Underutilized) merge segments from underutilized blocks into [larger] fully-utilized blocks -- note this shouldn't be used for single-directory repacking, because the unreferenced portions of blocks might be referenced by files elsewhere in the collection
  • optionally (opts.CachedOnly) skip blocks that aren't in the local cache; see diskCacheProber below
  • optionally (opts.Full) generate optimal repacking based on assumption that no further files will be written (we might postpone implementing this at first, since it's not needed for webdav)
  • optionally (opts.DryRun) don't apply changes, just report what would happen (for tests and possibly a future Workbench feature that hints when explicit repack is advisable)
  • remember which segments got remapped, so the changes can be pushed later; see Sync below
  • repacking algorithm performance goal: reasonable amortized cost & reasonably well-packed collection when called after each file in a set of sequential/concurrent small file writes
    • e.g., after writing 64 100-byte files, there should be fewer than 64 blocks, but the first file's data should have been rewritten far fewer than 64 times
    • test suite should confirm decent performance in some pathological cases
Add diskCacheProber type that allows caller to efficiently check whether a block is in local cache
  • copy an existing DiskCache and change its KeepGateway changed to a gateway that fails reads/writes
  • to check whether a block is in cache, ask the DiskCache to read 0 bytes
  • avoids the cost of transferring any data or connecting to a backend
  • edge case: this will also return true for a block that is currently being read from a backend into the cache -- this is arguably not really "in cache" and reading the data could still be slow or return a backend error, however, it should be OK to treat it as available for repacking purposes.

Update (collectionFileSystem)Sync() to invoke replace_segments if the collection has been repacked


Files

22320-writes-vs-content.png (12 KB) 22320-writes-vs-content.png Tom Clegg, 02/11/2025 04:27 PM
22320-bytes-vs-files-saved.png (9.56 KB) 22320-bytes-vs-files-saved.png Tom Clegg, 02/11/2025 04:27 PM
22320-blocks-vs-files-saved.png (40.4 KB) 22320-blocks-vs-files-saved.png Tom Clegg, 02/11/2025 04:27 PM
22320-100x8x1-bytes.png (12 KB) 22320-100x8x1-bytes.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x8x1-blocks.png (31.4 KB) 22320-100x8x1-blocks.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x8x1-time.png (11.7 KB) 22320-100x8x1-time.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x10-blocks.png (13.5 KB) 22320-100x10-blocks.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x10-bytes.png (13.4 KB) 22320-100x10-bytes.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x10-time.png (16.6 KB) 22320-100x10-time.png Tom Clegg, 03/07/2025 03:43 PM
22320-1000x1-blocks.png (31.5 KB) 22320-1000x1-blocks.png Tom Clegg, 03/07/2025 03:43 PM
22320-1000x1-bytes.png (9.53 KB) 22320-1000x1-bytes.png Tom Clegg, 03/07/2025 03:43 PM
22320-1000x1-time.png (9.03 KB) 22320-1000x1-time.png Tom Clegg, 03/07/2025 03:43 PM
22320-sourcetree-blocks.png (40.9 KB) 22320-sourcetree-blocks.png Tom Clegg, 03/07/2025 03:44 PM
22320-sourcetree-time.png (13 KB) 22320-sourcetree-time.png Tom Clegg, 03/07/2025 03:44 PM
22320-sourcetree-bytes.png (11.8 KB) 22320-sourcetree-bytes.png Tom Clegg, 03/07/2025 03:44 PM

Subtasks 1 (0 open1 closed)

Task #22344: Review 22320-cached-onlyResolvedTom Clegg03/17/2025Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Idea #20996: Efficient packing of small files into blocks in keep-webResolvedTom Clegg11/13/2024Actions
Precedes (7 days) Arvados - Bug #22666: Add tests that keepstore.BlockRead properly handles CheckCacheOnly optionResolvedTom CleggActions
Actions #1

Updated by Tom Clegg 5 months ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg 5 months ago

  • Related to Idea #20996: Efficient packing of small files into blocks in keep-web added
Actions #3

Updated by Peter Amstutz 4 months ago

  • Target version changed from Development 2024-12-04 to Development 2025-01-08
Actions #4

Updated by Peter Amstutz 3 months ago

  • Target version changed from Development 2025-01-08 to Development 2025-01-29
Actions #5

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2025-01-29 to Development 2025-02-12
Actions #6

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2025-02-12 to Development 2025-02-26
Actions #7

Updated by Tom Clegg 2 months ago

  • Status changed from New to In Progress
Actions #8

Updated by Peter Amstutz 2 months ago

  • Target version changed from Development 2025-02-26 to Development 2025-02-12
Actions #10

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2025-02-12 to Development 2025-02-26
Actions #11

Updated by Brett Smith about 1 month ago

  • Target version changed from Development 2025-02-26 to Development 2025-03-19
Actions #14

Updated by Tom Clegg 28 days ago

Here are some charts, generated with 2a0360bfc3, showing repacking results with storage and time cost for a few different access patterns.

y blocks in manifest after x write+repack ops y bytes written to backend after x bytes written to collection y seconds cumulative repack time after x write+repack ops
100x 10M files (no repacking)
1000x 1M files
100x 8M files, but the first 1M of all 100 files is written first, then the next 1M of each file, etc.
the arvados source tree in lexical order
Notes:
  • Repack time includes checking whether repacking is needed, and rewriting data if so. In most cases repacking is not needed, so on average, time is mostly determined by the number of segments in the manifest.
  • None of the examples have a backend/content ratio over ~2. IOW after the client has uploaded X bytes, we have written ≤2X bytes to the backend, so we're leaving ≤X bytes to be garbage-collected.
  • Timing charts are pretty smooth, which means the repacking work is being spread out pretty evenly across a sequence of client writes.
  • I find the source tree example very encouraging: in a (pretty much) worst case, when the uploaded file sizes are variable and in no particular order, the repacking effort is well amortized and the manifest doesn't have an outrageous number of blocks at any point.
Actions #15

Updated by Tom Clegg 28 days ago

22320-repack-small-blocks @ 5ef9e603fe3b12b3edf44fbab973d6dfbb7c0a03 -- developer-run-tests: #4684

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • "Underutilized" not implemented -- I don't think we have a specific use case in mind so I think we should drop it for now
    • "CachedOnly" not implemented yet -- I'd like to do this in separate branch after reviewing/merging this branch
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • ✅ automated tests added
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • ✅ comment in sdk/go/arvados.CollectionFileSystem interface
  • Behaves appropriately at the intended scale (describe intended scale).
    • see charts above
  • Considered backwards and forwards compatibility issues between client and server.
    • ✅ n/a
  • Follows our coding standards and GUI style guidelines.
Actions #17

Updated by Brett Smith 23 days ago

Tom Clegg wrote in #note-15:

22320-repack-small-blocks @ 5ef9e603fe3b12b3edf44fbab973d6dfbb7c0a03 -- developer-run-tests: #4684

So, my feelings about testRepackCost and its callers are written in the commit message of 9de62c160f7cc9c9c994e4ced447cd5bef88333e. I don't have any problem with automated performance checks in principle. But our existing test infrastructure doesn't feel like a good place to run them, at least as-is, because we rarely look at our test results beyond the boolean pass/fail result. And so we quickly get to the point where the cost of running these "tests" with every test run, and maintaining the code, outweighs the value we get from them.

(If I wanted to do some bad armchair psychoanalysis (I don't), I could read into the fact that you save the reports to a throwaway tmp directory and not some dedicated "results" directory. I understand we don't really have any pattern or precedent for the latter, but still, show a little conviction man.)

Here are some ideas that would make me feel like merging this was more worthwhile. I'm not saying any of these are required for merge, just as points to consider and maybe discuss:

  • Checking that seconds_repacking doesn't exceed a particular threshold. I'm okay with that threshold being conservative, accounting for varying test hardware, as long as we're on the same page that conservative ≠ pathologically high. At least then in principle the test would fail on a significant performance regression. (Maybe we could check that a given case doesn't exceed some multiple of a simple base case?)
  • A commitment to capture the reports as Jenkins build artifacts.
  • Rigging up the test runner, one way or another, so that a someone running test sdk/go/arvados got a message near the end of the output, after all routine test logging, letting them know where the report files were written. This would at least improve discoverability and make us less likely to forget it.

Beyond that, just one typo and one maybe-typo I noticed in sdk/go/arvados/fs_collection_test.go:

// Pack sibling files' ("a" and "b") - You meant ("a" and "c") here.

locator := SignLocator(fmt.Sprintf("%x+%d", md5.Sum(buf), ...)) - I have fixed at least one test bug by changing code like this to use %032x in the format string, to make sure we always get a 32-character md5sum even with leading zeros. Maybe it's not an issue here, now, but maybe an ounce of prevention is worth a pound of cure in future development?

Thanks.

Actions #18

Updated by Brett Smith 22 days ago

Brett Smith wrote in #note-17:

  • Checking that seconds_repacking doesn't exceed a particular threshold. I'm okay with that threshold being conservative, accounting for varying test hardware, as long as we're on the same page that conservative ≠ pathologically high. At least then in principle the test would fail on a significant performance regression. (Maybe we could check that a given case doesn't exceed some multiple of a simple base case?)

Now that I've slept on it I think this is actually a bad idea. There are dedicated tools for doing performance testing and checking for regressions and stuff like that. They do things like running your code multiple times and collecting enough data to be able to tell the difference between a one-time fluke and a real code regression. If we want something like that, we should pick one of these tools and build on it. We should not try to build a half-baked version into our test suite that isn't designed for this kind of data collection and reporting.

Actions #19

Updated by Tom Clegg 22 days ago

22320-repack-small-blocks @ b6a9ce888c5c53ab5e81b8fbc5495e5266583e49 -- developer-run-tests: #4705

22320-repack-small-blocks @ a686e95fbae4a1b388f3de73a199cecfa32babd0 -- developer-run-tests: #4706

I've added a "first 500 files in source tree" test, and disabled the rest (and the TSV file generation) behind an ARVADOS_TEST_REPACK_CHARTS env var. Rather than check timing, they just check write magnification < 4 and finishing with a reasonable number of blocks. This way
  • we're only adding ~1 second to the test suite, not tens of seconds
  • we're not generating extra files in miscellaneous places until someone explicitly enables that
  • we ensure the code that generates chart data still compiles, and still passes the "partial source tree" case, so it is likely to still work in the future when we do want to enable it
  • we continue to test for two obvious/bad bugs:
    • writing N bytes of data across many files results in >4N worth of writes on the backend due to overly aggressive repacking
    • repacking isn't doing anything/much at all, so writing N small files results in [nearly] N small blocks
Actions #20

Updated by Tom Clegg 21 days ago

Brett Smith wrote in #note-17:

locator := SignLocator(fmt.Sprintf("%x+%d", md5.Sum(buf), ...)) - I have fixed at least one test bug by changing code like this to use %032x in the format string, to make sure we always get a 32-character md5sum even with leading zeros. Maybe it's not an issue here, now, but maybe an ounce of prevention is worth a pound of cure in future development?

%x already outputs the leading zeroes -- in docs under "slice of bytes":

%x    base 16, lower-case, two characters per byte

(If you had to fix this somewhere, I'm guessing it was an int being formatted rather than a slice...?)

Actions #21

Updated by Brett Smith 21 days ago

Tom Clegg wrote in #note-19:

22320-repack-small-blocks @ a686e95fbae4a1b388f3de73a199cecfa32babd0 -- developer-run-tests: #4706

LGTM, thanks.

(If you had to fix this somewhere, I'm guessing it was an int being formatted rather than a slice...?)

Also possible it was a different language entirely, sprintf formatters are only semi-standardized and I don't remember the details that well.

Actions #22

Updated by Tom Clegg 21 days ago

22320-repack-small-blocks merged.

22320-cached-only @ a8819a88b21756b32db30dd8078ac005e757868f -- developer-run-tests: #4708

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • ✅ if RepackOptions.CachedOnly is true, only plan to repack blocks that are in local cache
    • The implementation details changed. I started to implement diskCacheProber as described above, but I wasn't comfortable making the ReadAt API weird enough for it to work ("if reading into a specific magical slice whose memory address we recognize, and data is not in cache, then return an error"). Instead I added an explicit CheckCacheOnly field to BlockReadOptions. This required adding the BlockRead method to some other interfaces/stubs that previously only needed ReadAt, but I think it's much more sane.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • ✅ added (*keepCacheSuite)TestBlockRead_CheckCacheOnly
    • manual testing n/a
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • ✅ comment on CheckCacheOnly field
  • Behaves appropriately at the intended scale (describe intended scale).
    • ✅ this does add some code and <=1 open() call per manifest segment, so it will be slower in the no-repack case. I think #22321 will be a better place to assess performance.
  • Considered backwards and forwards compatibility issues between client and server.
    • ✅ no concerns
  • Follows our coding standards and GUI style guidelines.
Actions #23

Updated by Brett Smith 21 days ago

Tom Clegg wrote in #note-22:

22320-cached-only @ a8819a88b21756b32db30dd8078ac005e757868f

My only comment is I feel like keepstore.BlockRead should get a couple small tests to make sure it handles the new CheckCacheOnly option correctly. (The other users I'll grant seem to be testing/internal-only and less critical.) Other than that, lgtm.

Actions #24

Updated by Brett Smith 18 days ago

  • Precedes Bug #22666: Add tests that keepstore.BlockRead properly handles CheckCacheOnly option added
Actions #25

Updated by Brett Smith 18 days ago

Given everything going on, I went ahead and merged the branch as-is, and filed #22666 for the requested tests. It's okay if they come a little later since this functionality doesn't have any immediate users. But we should make sure it's tested before it gets users.

Actions #26

Updated by Peter Amstutz 16 days ago

  • Target version changed from Development 2025-03-19 to Development 2025-04-02
Actions #27

Updated by Tom Clegg 11 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF