Feature #8992

arv-put: option to create a "stream" per-file (default) or per-collection

Added by Peter Grandi about 6 years ago. Updated over 2 years ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


As in https://dev.arvados.org/issues/8769 I was dismayed to learn that arv-put by default computes block hashes on a per-stream (usually per-collection) basis, which means that in the general cases the same file will have a different hash list depending on how it has been uploaded, that is its position in a "stream" containing the contents of all files in a collection or it being the only member of a "stream".

This has the surprising effect that the same file uploaded twice to the same or different Keep instances by arv-put may have and usually will have different hash-lists associated with it. Apparently arv-mount does not do that by default.

Since concatenating the content of several files into one "stream" is an optimization for small files, similar to OS/MVS "partitioned datasets", or UNIX "ar" archives, it would be better if it were optional, and not the default either.

In part because given the huge latencies (which include manifest download time) Keep is not that suited to storing small files, in part because most keep storage backends have granularities well below the 64MiB size of a keep block, in part because it is not fully documented, in part because reproducibility of hashes across time and Keep instances would be nice.

It would also be nice to have a direct option similar to --md5sum in arv-get that computes the list of hashes for a local file without uploading it at all, and to add a corresponding option to arv-get that prints that list in exactly the same format.

Related issues

Related to Arvados - Story #8791: [SDK] CollectionWriter file packing causes sub-optimal deduplicationDuplicate


#1 Updated by Peter Amstutz over 2 years ago

  • Status changed from New to Closed

Also available in: Atom PDF