Separating files from collections » History » Version 1

Version 1/3 - Next » - Current version
Peter Amstutz, 01/15/2015 09:17 PM

Indexing files in collections


Users cannot find out which collections contain a given file. This is a hole in our provenance system, since a user can easily lose track of where a file came from if she is manipulating collections using workbench instead of through running jobs. A user should be able to query the system for all instances of a given file, across collections and renames.

Proposed Solution

Identify file contents based on its segments and blocks. Construct a normalized manifest containing only the file in the "." stream and with a known name, such as "data". Take the hash of this single-file normalized manifest and use that hash to represent the file.

(This does mean that that the same file contents allocated differently resulting in a different set of blocks would not be considered the "same". An alternate approach to identifying files uniquely would be to calculate the hash of the file contents directly. This is somewhat expensive, but could be accomplished by a background indexing service such as DataManager.)

Introduce two new tables, tentatively called "files" and "collection_entries".

The "files" table contains the manifest with the file blocks and segments, keyed on the hash of that manifest.

The "collection_entries" table contains a collection uuid, file name, and hash which keys into the file table.

We can then search the collection_entries table to find files that match a given file name pattern, or search the collections with files containing the desired content. We could also make metadata assertions about files independently from the collection(s) the file is located in.


If manifests are required to be normalized, we could dispense with manifest_text altogether in the Collections table and recreate the manifest text on demand based on CollectionEntries.

We could add a "blocks" table which links Keep blocks to records in the "file" table (and to Collections through an additional join). This would enable reverse lookup of blocks to determine if a user is allowed to know if a block already exists. This would permit "should I upload this block or not?" queries enabling arv-put logic that can skip sending blocks that are already present on the remote instance.