Separating files from collections » History » Version 3
Peter Amstutz, 02/12/2015 06:15 PM
1 | 2 | Peter Amstutz | h1. Separating files from collections |
---|---|---|---|
2 | 1 | Peter Amstutz | |
3 | h2. Problem |
||
4 | |||
5 | Users cannot find out which collections contain a given file. This is a hole in our provenance system, since a user can easily lose track of where a file came from if she is manipulating collections using workbench instead of through running jobs. A user should be able to query the system for all instances of a given file, across collections and renames. |
||
6 | |||
7 | h2. Proposed Solution |
||
8 | |||
9 | 3 | Peter Amstutz | Identify files by the hash of the file contents. This is somewhat expensive, but could be accomplished by a background service such as DataManager. |
10 | 1 | Peter Amstutz | |
11 | Introduce two new tables, tentatively called "files" and "collection_entries". |
||
12 | |||
13 | 3 | Peter Amstutz | The "files" table contains the manifest with the file blocks and segments, keyed on the hash of the file contents. |
14 | 1 | Peter Amstutz | |
15 | 3 | Peter Amstutz | The "collection_entries" table contains a collection uuid, file name, and key into the file table. |
16 | 1 | Peter Amstutz | |
17 | We can then search the collection_entries table to find files that match a given file name pattern, or search the collections with files containing the desired content. We could also make metadata assertions about files independently from the collection(s) the file is located in. |
||
18 | |||
19 | h2. Extensions |
||
20 | |||
21 | If manifests are required to be normalized, we could dispense with manifest_text altogether in the Collections table and recreate the manifest text on demand based on CollectionEntries. |
||
22 | 3 | Peter Amstutz | |
23 | When uploading, it could hash the local file and check to see if the file already exists before uploading. |
||
24 | 1 | Peter Amstutz | |
25 | We could add a "blocks" table which links Keep blocks to records in the "file" table (and to Collections through an additional join). This would enable reverse lookup of blocks to determine if a user is allowed to know if a block already exists. This would permit "should I upload this block or not?" queries enabling arv-put logic that can skip sending blocks that are already present on the remote instance. |