Idea #22458
Updated by Peter Amstutz 8 days ago
For provenance, I would like to keep collection records around. However, in some cases I don't want to store the intermediate data. For example, I might have processing steps where the output is just as large or larger than the input data. Propose being able to set @replication_desired@ to zero to indicate that the underlying blocks can be trashed by keep-balance, without them being reported as "missing" blocks. Once set to zero, @replication_desired@ cannot be increased. I call these "ghost collections". (Another name that just came to me is "dehydrated" or "freeze dried" collections). Fetching a ghost collection returns an unsigned manifest. Ghost collection records should behave similarly to frozen projects: read-only, except for being moved between projects (it might be ok to edit metadata such as name and properties as well). Similar to @trash_at@ / @delete_at@, it would also be nice to have a @ghost_at@ field, and a corresponding @output_ghost_ttl@ on container requests that lets you specify that a collection should be ghosted at some point in the future -- helpful to keep intermediate results around for a little while, but not forever. Clients such as Workbench, keep-web, Python SDK, etc should be made aware of ghost collections, so that they return a sensible error if the user tries to read a file, instead of a scary "failed to read block" error. If the ghost collection exists on another cluster readable by the user, it should be possible to automatically fetch the blocks via federation, or rematerialize/rehydrate the collection by downloading all the blocks from somewhere else and re-writing the manifest with current block signatures as proof the collection is readable again.