Idea #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplication - Arvados

Actions

Copy link

Idea #8791

closed

[SDK] CollectionWriter file packing causes sub-optimal deduplication

Added by Peter Amstutz almost 9 years ago. Updated about 8 years ago.

Status:

Duplicate

Priority:

Normal

Assigned To:

Category:

Target version:

Start date:

Due date:

Story points:

Description

The Python SDK CollectionWriter class used by arv-put concatenates files when writing the file stream. As a result, if the same files are written in a different order, frequently the resulting blocks will have different hashes (due to different overlap of content across blocks) which breaks de-duplication. A user has encountered this problem when re-uploading a data set that required a significant fraction of the underlying storage capacity.

CollectionWriter should ensure that large files are aligned to block boundaries; one possible behavior is to set a threshold size over which a file always triggers a flush of the current block and starting a new block in the file stream.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Related to Arvados - Bug #8769: re-upload seems to consume a lot of space	Resolved		03/22/2016	Actions
Related to Arvados - Feature #8992: arv-put: option to create a "stream" per-file (default) or per-collection	Closed		04/14/2016	Actions
Related to Arvados - Bug #9701: [SDKs] Python SDK Collection class should pack small files into large data blocks	Resolved	Lucas Di Pentima	08/03/2016	Actions

Project

General

Profile

Arvados

Custom queries

Watchers (1)

Idea #8791

[SDK] CollectionWriter file packing causes sub-optimal deduplication

Updated by Peter Amstutz almost 9 years ago

Updated by Brett Smith almost 9 years ago

Updated by Peter Amstutz almost 9 years ago

Updated by Tom Clegg almost 9 years ago

Updated by Peter Grandi almost 9 years ago

Updated by Brett Smith almost 9 years ago

Updated by Brett Smith almost 9 years ago

Updated by Tom Morris over 8 years ago

Updated by Tom Morris about 8 years ago