Idea #8791
closed[SDK] CollectionWriter file packing causes sub-optimal deduplication
Description
The Python SDK CollectionWriter class used by arv-put concatenates files when writing the file stream. As a result, if the same files are written in a different order, frequently the resulting blocks will have different hashes (due to different overlap of content across blocks) which breaks de-duplication. A user has encountered this problem when re-uploading a data set that required a significant fraction of the underlying storage capacity.
CollectionWriter should ensure that large files are aligned to block boundaries; one possible behavior is to set a threshold size over which a file always triggers a flush of the current block and starting a new block in the file stream.
Updated by Peter Amstutz almost 9 years ago
- Subject changed from [SDK] arv-put file packing breaks deduplication to [SDK] CollectionWriter file packing breaks deduplication
- Description updated (diff)
Updated by Brett Smith almost 9 years ago
- Target version set to Arvados Future Sprints
Updated by Peter Amstutz almost 9 years ago
Going a bit further, we could also implement a rolling hash function to choose block boundaries during file upload, to improve deduplication even more:
Updated by Tom Clegg almost 9 years ago
- Subject changed from [SDK] CollectionWriter file packing breaks deduplication to [SDK] CollectionWriter file packing causes sub-optimal deduplication
Updated by Peter Grandi almost 9 years ago
I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to arv-put
:
Updated by Brett Smith almost 9 years ago
Peter Grandi wrote:
I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to
arv-put
:
I understand the current story description is technical enough that it's not easy to tell, but what this ticket proposes is to have the default behavior be at least close to the per-file behavior you request in #8992. We might continue to combine "small" files into a single block (where the definition of "small" is TBD, and may be configurable, but the default threshold would probably be between 4 and 16 MiB), but we would make sure larger files get stored as an independent set of blocks, for the deduplication benefits you've talked about there and on IRC.
If that was the default behavior of arv keep put
, would that be enough to obviate the need for #8992?
Updated by Brett Smith almost 9 years ago
Peter Amstutz wrote:
CollectionWriter should ensure that large files are aligned to block boundaries
Do we actually have to change the behavior of CollectionWriter? Or would it be sufficient to change the behavior of arv keep put
—either by having it "flushing" CollectionWriter more often, or upgrading it to use the Collection class instead?
I'd rather just change arv keep put
, and not muck with CollectionWriter. If there's some reason that's not good enough from a user perspective, it'd be good to know.
Updated by Tom Morris about 8 years ago
- Target version deleted (
Arvados Future Sprints)