Project

General

Profile

Idea #21942

Updated by Peter Amstutz 5 months ago

User has a collection which consists of several thousand files that are 100-200 bytes each. 

 Each file was sourced from a different workflow output collection. 

 When these files were created, small file packing was applied, as a result these 100-200 byte files are embedded in a data block that is 50-60 MiB. 

 As a result, going through this collection and reading each file is much slower than expected, because behind the scenes, Arvados must fetch a 50-60 MiB block to extract the 100-200 byte slice the making up the file. 

 Think about ways to behave more efficiently where < 50% of a given block is referenced by a collection. 

 A couple ideas: 

 # Support range requests (interacts poorly with caching, though) 
 # When constructing/saving a collection that would be like this, do an "optimize" pass to rewrite/repack files 

Back