Project

General

Profile

Actions

Bug #9701

closed

[SDKs] Python SDK Collection class should pack small files into large data blocks

Added by Tom Clegg about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
SDKs
Target version:
Start date:
08/03/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

Background

In Keep, small blocks incur a lot of overhead per data byte. Clients are supposed to mitigate this by writing one large block instead of lots of tiny blocks, even when storing small files. CollectionWriter performs well in this respect but Collection does not.

Test case:
  • 2000 files (filenames = 113K)
  • 38M data
  • Keep volumes are local disks (lower latency than cloud/network-backed keepstores)
code path arv-put runtime manifest size # data blocks
CollectionWriter (pre-#9463 arv-put) 2 s 137K 1
Collection (post-#9463 arv-put) 44 s 300K 2000

Proposed solution

Add an optional "flush" argument to the close() method (default True). If False is given, don't commit block data right away.

When allocating a new bufferblock, first check whether there is a bufferblock that
  • is uncommitted, and
  • is small enough to accommodate the next file the caller is about to write (or max_block_size÷2 if the next file's size is not known), and
  • only contains data for files that have been closed.

This should achieve good block packing performance, but avoid producing pathological manifests when multiple files are being written concurrently.

Use close(flush=False) in arv-put.

Alternative solution (maybe)

Write lots of small bufferblocks like we do now, but merge them into larger blocks when it's time to commit them. This could handle "small file, large file, small file" writing patterns better, and wouldn't rely on the caller's ability to predict file sizes and communicate them to the SDK. However, it might be more difficult to implement.

Related improvements not included here

Use rolling hashes to choose block transitions.

Add mechanism for a caller to pass in the anticipated size of a file being written, so Collection et al. can make better decisions.


Subtasks 1 (0 open1 closed)

Task #10052: Review 9701-collection-pack-small-files-altResolvedLucas Di Pentima08/03/2016

Actions

Related issues

Related to Arvados - Story #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationDuplicate

Actions
Blocks Arvados - Story #9463: [SDKs] Change arv-put to use the Collection class under the hoodResolvedLucas Di Pentima07/11/2016

Actions
Actions

Also available in: Atom PDF