Project

General

Profile

Actions

Feature #22320

closed

Add Repack(opts RepackOptions) method to collectionfs, dirnode, and filehandle

Added by Tom Clegg 5 months ago. Updated 19 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Story points:
-

Description

From Efficient block packing for small WebDAV uploads
  • filehandle method only needs to be supported when target is a dirnode (repacking a file could be useful, e.g., fuse driver, but not needed for webdav)
  • traverse dir/filesystem, finding opportunities to merge small (<32MiB) blocks into larger (>=32MiB) blocks
  • optionally (opts.Underutilized) merge segments from underutilized blocks into [larger] fully-utilized blocks -- note this shouldn't be used for single-directory repacking, because the unreferenced portions of blocks might be referenced by files elsewhere in the collection
  • optionally (opts.CachedOnly) skip blocks that aren't in the local cache; see diskCacheProber below
  • optionally (opts.Full) generate optimal repacking based on assumption that no further files will be written (we might postpone implementing this at first, since it's not needed for webdav)
  • optionally (opts.DryRun) don't apply changes, just report what would happen (for tests and possibly a future Workbench feature that hints when explicit repack is advisable)
  • remember which segments got remapped, so the changes can be pushed later; see Sync below
  • repacking algorithm performance goal: reasonable amortized cost & reasonably well-packed collection when called after each file in a set of sequential/concurrent small file writes
    • e.g., after writing 64 100-byte files, there should be fewer than 64 blocks, but the first file's data should have been rewritten far fewer than 64 times
    • test suite should confirm decent performance in some pathological cases
Add diskCacheProber type that allows caller to efficiently check whether a block is in local cache
  • copy an existing DiskCache and change its KeepGateway changed to a gateway that fails reads/writes
  • to check whether a block is in cache, ask the DiskCache to read 0 bytes
  • avoids the cost of transferring any data or connecting to a backend
  • edge case: this will also return true for a block that is currently being read from a backend into the cache -- this is arguably not really "in cache" and reading the data could still be slow or return a backend error, however, it should be OK to treat it as available for repacking purposes.

Update (collectionFileSystem)Sync() to invoke replace_segments if the collection has been repacked


Files

22320-writes-vs-content.png (12 KB) 22320-writes-vs-content.png Tom Clegg, 02/11/2025 04:27 PM
22320-bytes-vs-files-saved.png (9.56 KB) 22320-bytes-vs-files-saved.png Tom Clegg, 02/11/2025 04:27 PM
22320-blocks-vs-files-saved.png (40.4 KB) 22320-blocks-vs-files-saved.png Tom Clegg, 02/11/2025 04:27 PM
22320-100x8x1-bytes.png (12 KB) 22320-100x8x1-bytes.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x8x1-blocks.png (31.4 KB) 22320-100x8x1-blocks.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x8x1-time.png (11.7 KB) 22320-100x8x1-time.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x10-blocks.png (13.5 KB) 22320-100x10-blocks.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x10-bytes.png (13.4 KB) 22320-100x10-bytes.png Tom Clegg, 03/07/2025 03:43 PM
22320-100x10-time.png (16.6 KB) 22320-100x10-time.png Tom Clegg, 03/07/2025 03:43 PM
22320-1000x1-blocks.png (31.5 KB) 22320-1000x1-blocks.png Tom Clegg, 03/07/2025 03:43 PM
22320-1000x1-bytes.png (9.53 KB) 22320-1000x1-bytes.png Tom Clegg, 03/07/2025 03:43 PM
22320-1000x1-time.png (9.03 KB) 22320-1000x1-time.png Tom Clegg, 03/07/2025 03:43 PM
22320-sourcetree-blocks.png (40.9 KB) 22320-sourcetree-blocks.png Tom Clegg, 03/07/2025 03:44 PM
22320-sourcetree-time.png (13 KB) 22320-sourcetree-time.png Tom Clegg, 03/07/2025 03:44 PM
22320-sourcetree-bytes.png (11.8 KB) 22320-sourcetree-bytes.png Tom Clegg, 03/07/2025 03:44 PM

Subtasks 1 (0 open1 closed)

Task #22344: Review 22320-cached-onlyResolvedTom Clegg03/17/2025Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Idea #20996: Efficient packing of small files into blocks in keep-webResolvedTom Clegg11/13/2024Actions
Precedes (7 days) Arvados - Bug #22666: Add tests that keepstore.BlockRead properly handles CheckCacheOnly optionResolvedTom CleggActions
Actions

Also available in: Atom PDF