Story #8791

[SDK] CollectionWriter file packing causes sub-optimal deduplication

Added by Peter Amstutz about 5 years ago. Updated about 4 years ago.

Status:
Duplicate
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

The Python SDK CollectionWriter class used by arv-put concatenates files when writing the file stream. As a result, if the same files are written in a different order, frequently the resulting blocks will have different hashes (due to different overlap of content across blocks) which breaks de-duplication. A user has encountered this problem when re-uploading a data set that required a significant fraction of the underlying storage capacity.

CollectionWriter should ensure that large files are aligned to block boundaries; one possible behavior is to set a threshold size over which a file always triggers a flush of the current block and starting a new block in the file stream.


Related issues

Related to Arvados - Bug #8769: re-upload seems to consume a lot of spaceResolved03/22/2016

Related to Arvados - Feature #8992: arv-put: option to create a "stream" per-file (default) or per-collectionClosed04/14/2016

Related to Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blocksResolved08/03/2016

History

#1 Updated by Peter Amstutz about 5 years ago

  • Subject changed from [SDK] arv-put file packing breaks deduplication to [SDK] CollectionWriter file packing breaks deduplication
  • Description updated (diff)

#2 Updated by Brett Smith about 5 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Peter Amstutz about 5 years ago

Going a bit further, we could also implement a rolling hash function to choose block boundaries during file upload, to improve deduplication even more:

https://en.wikipedia.org/wiki/Rolling_hash

https://crypto.stackexchange.com/questions/16082/cryptographically-secure-keyed-rolling-hash-function

#4 Updated by Tom Clegg about 5 years ago

  • Subject changed from [SDK] CollectionWriter file packing breaks deduplication to [SDK] CollectionWriter file packing causes sub-optimal deduplication

#5 Updated by Peter Grandi about 5 years ago

I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to arv-put:

https://dev.arvados.org/issues/8992

#6 Updated by Brett Smith about 5 years ago

Peter Grandi wrote:

I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to arv-put:

https://dev.arvados.org/issues/8992

I understand the current story description is technical enough that it's not easy to tell, but what this ticket proposes is to have the default behavior be at least close to the per-file behavior you request in #8992. We might continue to combine "small" files into a single block (where the definition of "small" is TBD, and may be configurable, but the default threshold would probably be between 4 and 16 MiB), but we would make sure larger files get stored as an independent set of blocks, for the deduplication benefits you've talked about there and on IRC.

If that was the default behavior of arv keep put, would that be enough to obviate the need for #8992?

#7 Updated by Brett Smith about 5 years ago

Peter Amstutz wrote:

CollectionWriter should ensure that large files are aligned to block boundaries

Do we actually have to change the behavior of CollectionWriter? Or would it be sufficient to change the behavior of arv keep put—either by having it "flushing" CollectionWriter more often, or upgrading it to use the Collection class instead?

I'd rather just change arv keep put, and not muck with CollectionWriter. If there's some reason that's not good enough from a user perspective, it'd be good to know.

#8 Updated by Tom Morris over 4 years ago

  • Status changed from New to Duplicate

Addressed by #9701

#9 Updated by Tom Morris about 4 years ago

  • Target version deleted (Arvados Future Sprints)

Also available in: Atom PDF