Idea #8791
closed
[SDK] CollectionWriter file packing causes sub-optimal deduplication
Added by Peter Amstutz almost 9 years ago.
Updated about 8 years ago.
Description
The Python SDK CollectionWriter class used by arv-put concatenates files when writing the file stream. As a result, if the same files are written in a different order, frequently the resulting blocks will have different hashes (due to different overlap of content across blocks) which breaks de-duplication. A user has encountered this problem when re-uploading a data set that required a significant fraction of the underlying storage capacity.
CollectionWriter should ensure that large files are aligned to block boundaries; one possible behavior is to set a threshold size over which a file always triggers a flush of the current block and starting a new block in the file stream.
- Subject changed from [SDK] arv-put file packing breaks deduplication to [SDK] CollectionWriter file packing breaks deduplication
- Description updated (diff)
- Target version set to Arvados Future Sprints
- Subject changed from [SDK] CollectionWriter file packing breaks deduplication to [SDK] CollectionWriter file packing causes sub-optimal deduplication
I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to arv-put
:
https://dev.arvados.org/issues/8992
Peter Grandi wrote:
I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to arv-put
:
https://dev.arvados.org/issues/8992
I understand the current story description is technical enough that it's not easy to tell, but what this ticket proposes is to have the default behavior be at least close to the per-file behavior you request in #8992. We might continue to combine "small" files into a single block (where the definition of "small" is TBD, and may be configurable, but the default threshold would probably be between 4 and 16 MiB), but we would make sure larger files get stored as an independent set of blocks, for the deduplication benefits you've talked about there and on IRC.
If that was the default behavior of arv keep put
, would that be enough to obviate the need for #8992?
Peter Amstutz wrote:
CollectionWriter should ensure that large files are aligned to block boundaries
Do we actually have to change the behavior of CollectionWriter? Or would it be sufficient to change the behavior of arv keep put
—either by having it "flushing" CollectionWriter more often, or upgrading it to use the Collection class instead?
I'd rather just change arv keep put
, and not muck with CollectionWriter. If there's some reason that's not good enough from a user perspective, it'd be good to know.
- Status changed from New to Duplicate
- Target version deleted (
Arvados Future Sprints)
Also available in: Atom
PDF