Idea #21942
Updated by Peter Amstutz 5 months ago
User has a collection which consists of several thousand files that are 100-200 bytes each.
Each file was sourced from a different workflow output collection.
When these files were created, small file packing was applied, as a result these 100-200 byte files are embedded in a data block that is 50-60 MiB.
As a result, going through this collection and reading each file is much slower than expected, because behind the scenes, Arvados must fetch a 50-60 MiB block to extract the 100-200 byte slice the making up the file.
Think about ways to behave more efficiently where < 50% of a given block is referenced by a collection.
A couple ideas:
# Support range requests (interacts poorly with caching, though)
# When constructing/saving a collection that would be like this, do an "optimize" pass to rewrite/repack files