Project

General

Profile

Idea #8791

Updated by Peter Amstutz about 8 years ago

The Python SDK CollectionWriter class used by arv-put concatenates files when writing the file stream.    As a result, if the same files are written in a different order, frequently the resulting blocks will have different hashes (due to different overlap of content across blocks) which breaks de-duplication.    A user has encountered this problem when re-uploading a data set that required a significant fraction of the underlying storage capacity. 

 CollectionWriter should ensure that large files are aligned to block boundaries; one possible behavior is to set a threshold size over which a file always triggers a flush of the current block and starting a new block in the file stream. 

Back