Project

General

Profile

Actions

Idea #8791

closed

[SDK] CollectionWriter file packing causes sub-optimal deduplication

Added by Peter Amstutz about 8 years ago. Updated over 7 years ago.

Status:
Duplicate
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

The Python SDK CollectionWriter class used by arv-put concatenates files when writing the file stream. As a result, if the same files are written in a different order, frequently the resulting blocks will have different hashes (due to different overlap of content across blocks) which breaks de-duplication. A user has encountered this problem when re-uploading a data set that required a significant fraction of the underlying storage capacity.

CollectionWriter should ensure that large files are aligned to block boundaries; one possible behavior is to set a threshold size over which a file always triggers a flush of the current block and starting a new block in the file stream.


Related issues

Related to Arvados - Bug #8769: re-upload seems to consume a lot of spaceResolved03/22/2016Actions
Related to Arvados - Feature #8992: arv-put: option to create a "stream" per-file (default) or per-collectionClosed04/14/2016Actions
Related to Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blocksResolvedLucas Di Pentima08/03/2016Actions
Actions

Also available in: Atom PDF