Feature #8992
closedarv-put: option to create a "stream" per-file (default) or per-collection
Description
As in https://dev.arvados.org/issues/8769 I was dismayed to learn that arv-put
by default computes block hashes on a per-stream (usually per-collection) basis, which means that in the general cases the same file will have a different hash list depending on how it has been uploaded, that is its position in a "stream" containing the contents of all files in a collection or it being the only member of a "stream".
This has the surprising effect that the same file uploaded twice to the same or different Keep instances by arv-put
may have and usually will have different hash-lists associated with it. Apparently arv-mount
does not do that by default.
Since concatenating the content of several files into one "stream" is an optimization for small files, similar to OS/MVS "partitioned datasets", or UNIX "ar" archives, it would be better if it were optional, and not the default either.
In part because given the huge latencies (which include manifest download time) Keep is not that suited to storing small files, in part because most keep storage backends have granularities well below the 64MiB size of a keep block, in part because it is not fully documented, in part because reproducibility of hashes across time and Keep instances would be nice.
It would also be nice to have a direct option similar to --md5sum
in arv-get
that computes the list of hashes for a local file without uploading it at all, and to add a corresponding option to arv-get
that prints that list in exactly the same format.