[SDKs] Python SDK Collection class should pack small files into large data blocks
In Keep, small blocks incur a lot of overhead per data byte. Clients are supposed to mitigate this by writing one large block instead of lots of tiny blocks, even when storing small files. CollectionWriter performs well in this respect but Collection does not.Test case:
- 2000 files (filenames = 113K)
- 38M data
- Keep volumes are local disks (lower latency than cloud/network-backed keepstores)
|code path||arv-put runtime||manifest size||# data blocks|
|CollectionWriter (pre-#9463 arv-put)||2 s||137K||1|
|Collection (post-#9463 arv-put)||44 s||300K||2000|
Add an optional "flush" argument to the close() method (default True). If False is given, don't commit block data right away.When allocating a new bufferblock, first check whether there is a bufferblock that
- is uncommitted, and
- is small enough to accommodate the next file the caller is about to write (or max_block_size÷2 if the next file's size is not known), and
- only contains data for files that have been closed.
This should achieve good block packing performance, but avoid producing pathological manifests when multiple files are being written concurrently.
Use close(flush=False) in arv-put.
Alternative solution (maybe)¶
Write lots of small bufferblocks like we do now, but merge them into larger blocks when it's time to commit them. This could handle "small file, large file, small file" writing patterns better, and wouldn't rely on the caller's ability to predict file sizes and communicate them to the SDK. However, it might be more difficult to implement.
Related improvements not included here¶
Use rolling hashes to choose block transitions.
Add mechanism for a caller to pass in the anticipated size of a file being written, so Collection et al. can make better decisions.