Feature #13062

Updated by Peter Amstutz over 1 year ago

Reduce collection class memory footprint in order to reduce the footprint of arv-mount and arvados-cwl-runner in order to run on smaller, cheaper nodes.

General approach: instead of parsing the manifest once and creating Python objects for every directory and file, reparse and create python objects on demand.

Possibly strategy:

* Initial manifest parsing creates an index that maps each directory path to one or more manifest streams (by offset or by using memoryview) which describe the contents of that directory.

* When the contents of a Collection or Subcollection is needed, look up the stream(s) associated with the Directory from the index and parse them.

* Consider doing something similar at individual file level, only load "segments" on demand (may come at cost of higher overhead if it turns out the client is going to visit most of the files in a given directory anyway).

* Make it possible for a caching strategy to evict loaded collection contents / file segments.


* Can't cache evict anything that's been returned to the (Python SDK) user unless we can determine it isn't being held (maybe requires reference counting scheme).

* For FUSE, stable inode assignment like https://dev.arvados.org/issues/12664#note-2 would allow us to evict things from the FUSE cache that are still known to the kernel.