Separating files from collections » History » Version 1

Peter Amstutz, 01/15/2015 09:17 PM

1 1 Peter Amstutz
h1. Indexing files in collections
2 1 Peter Amstutz
3 1 Peter Amstutz
h2. Problem
4 1 Peter Amstutz
5 1 Peter Amstutz
Users cannot find out which collections contain a given file.  This is a hole in our provenance system, since a user can easily lose track of where a file came from if she is manipulating collections using workbench instead of through running jobs.  A user should be able to query the system for all instances of a given file, across collections and renames.
6 1 Peter Amstutz
7 1 Peter Amstutz
h2. Proposed Solution
8 1 Peter Amstutz
9 1 Peter Amstutz
Identify file contents based on its segments and blocks.  Construct a normalized manifest containing only the file in the "." stream and with a known name, such as "data".  Take the hash of this single-file normalized manifest and use that hash to represent the file.
10 1 Peter Amstutz
11 1 Peter Amstutz
(This does mean that that the same file contents allocated differently resulting in a different set of blocks would not be considered the "same".  An alternate approach to identifying files uniquely would be to calculate the hash of the file contents directly.  This is somewhat expensive, but could be accomplished by a background indexing service such as DataManager.)
12 1 Peter Amstutz
13 1 Peter Amstutz
Introduce two new tables, tentatively called "files" and "collection_entries".
14 1 Peter Amstutz
15 1 Peter Amstutz
The "files" table contains the manifest with the file blocks and segments, keyed on the hash of that manifest.
16 1 Peter Amstutz
17 1 Peter Amstutz
The "collection_entries" table contains a collection uuid, file name, and hash which keys into the file table.
18 1 Peter Amstutz
19 1 Peter Amstutz
We can then search the collection_entries table to find files that match a given file name pattern, or search the collections with files containing the desired content.  We could also make metadata assertions about files independently from the collection(s) the file is located in.
20 1 Peter Amstutz
21 1 Peter Amstutz
h2. Extensions
22 1 Peter Amstutz
23 1 Peter Amstutz
If manifests are required to be normalized, we could dispense with manifest_text altogether in the Collections table and recreate the manifest text on demand based on CollectionEntries.
24 1 Peter Amstutz
25 1 Peter Amstutz
We could add a "blocks" table which links Keep blocks to records in the "file" table (and to Collections through an additional join).  This would enable reverse lookup of blocks to determine if a user is allowed to know if a block already exists.  This would permit "should I upload this block or not?" queries enabling arv-put logic that can skip sending blocks that are already present on the remote instance.