Project

General

Profile

Actions

Story #8791

closed

[SDK] CollectionWriter file packing causes sub-optimal deduplication

Added by Peter Amstutz over 6 years ago. Updated over 5 years ago.

Status:
Duplicate
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

The Python SDK CollectionWriter class used by arv-put concatenates files when writing the file stream. As a result, if the same files are written in a different order, frequently the resulting blocks will have different hashes (due to different overlap of content across blocks) which breaks de-duplication. A user has encountered this problem when re-uploading a data set that required a significant fraction of the underlying storage capacity.

CollectionWriter should ensure that large files are aligned to block boundaries; one possible behavior is to set a threshold size over which a file always triggers a flush of the current block and starting a new block in the file stream.


Related issues

Related to Arvados - Bug #8769: re-upload seems to consume a lot of spaceResolved03/22/2016

Actions
Related to Arvados - Feature #8992: arv-put: option to create a "stream" per-file (default) or per-collectionClosed04/14/2016

Actions
Related to Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blocksResolvedLucas Di Pentima08/03/2016

Actions
Actions #1

Updated by Peter Amstutz over 6 years ago

  • Subject changed from [SDK] arv-put file packing breaks deduplication to [SDK] CollectionWriter file packing breaks deduplication
  • Description updated (diff)
Actions #2

Updated by Brett Smith over 6 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Peter Amstutz over 6 years ago

Going a bit further, we could also implement a rolling hash function to choose block boundaries during file upload, to improve deduplication even more:

https://en.wikipedia.org/wiki/Rolling_hash

https://crypto.stackexchange.com/questions/16082/cryptographically-secure-keyed-rolling-hash-function

Actions #4

Updated by Tom Clegg over 6 years ago

  • Subject changed from [SDK] CollectionWriter file packing breaks deduplication to [SDK] CollectionWriter file packing causes sub-optimal deduplication
Actions #5

Updated by Peter Grandi over 6 years ago

I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to arv-put:

https://dev.arvados.org/issues/8992

Actions #6

Updated by Brett Smith over 6 years ago

Peter Grandi wrote:

I have written a related feature request with a different motivation (and I know some of you don't think it is important...) than deduplication here, arguing that per-collection or per-file "streams" should be an option to arv-put:

https://dev.arvados.org/issues/8992

I understand the current story description is technical enough that it's not easy to tell, but what this ticket proposes is to have the default behavior be at least close to the per-file behavior you request in #8992. We might continue to combine "small" files into a single block (where the definition of "small" is TBD, and may be configurable, but the default threshold would probably be between 4 and 16 MiB), but we would make sure larger files get stored as an independent set of blocks, for the deduplication benefits you've talked about there and on IRC.

If that was the default behavior of arv keep put, would that be enough to obviate the need for #8992?

Actions #7

Updated by Brett Smith over 6 years ago

Peter Amstutz wrote:

CollectionWriter should ensure that large files are aligned to block boundaries

Do we actually have to change the behavior of CollectionWriter? Or would it be sufficient to change the behavior of arv keep put—either by having it "flushing" CollectionWriter more often, or upgrading it to use the Collection class instead?

I'd rather just change arv keep put, and not muck with CollectionWriter. If there's some reason that's not good enough from a user perspective, it'd be good to know.

Actions #8

Updated by Tom Morris almost 6 years ago

  • Status changed from New to Duplicate

Addressed by #9701

Actions #9

Updated by Tom Morris over 5 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF