Separating files from collections

Problem

Users cannot find out which collections contain a given file. This is a hole in our provenance system, since a user can easily lose track of where a file came from if she is manipulating collections using workbench instead of through running jobs. A user should be able to query the system for all instances of a given file, across collections and renames.

Proposed Solution

Identify files by the hash of the file contents. This is somewhat expensive, but could be accomplished by a background service such as DataManager.

Introduce two new tables, tentatively called "files" and "collection_entries".

The "files" table contains the manifest with the file blocks and segments, keyed on the hash of the file contents.

The "collection_entries" table contains a collection uuid, file name, and key into the file table.

We can then search the collection_entries table to find files that match a given file name pattern, or search the collections with files containing the desired content. We could also make metadata assertions about files independently from the collection(s) the file is located in.

Extensions

If manifests are required to be normalized, we could dispense with manifest_text altogether in the Collections table and recreate the manifest text on demand based on CollectionEntries.

When uploading, it could hash the local file and check to see if the file already exists before uploading.

We could add a "blocks" table which links Keep blocks to records in the "file" table (and to Collections through an additional join). This would enable reverse lookup of blocks to determine if a user is allowed to know if a block already exists. This would permit "should I upload this block or not?" queries enabling arv-put logic that can skip sending blocks that are already present on the remote instance.