Project

General

Profile

Separating files from collections » History » Version 3

Peter Amstutz, 02/12/2015 06:15 PM

1 2 Peter Amstutz
h1. Separating files from collections
2 1 Peter Amstutz
3
h2. Problem
4
5
Users cannot find out which collections contain a given file.  This is a hole in our provenance system, since a user can easily lose track of where a file came from if she is manipulating collections using workbench instead of through running jobs.  A user should be able to query the system for all instances of a given file, across collections and renames.
6
7
h2. Proposed Solution
8
9 3 Peter Amstutz
Identify files by the hash of the file contents.  This is somewhat expensive, but could be accomplished by a background service such as DataManager.
10 1 Peter Amstutz
11
Introduce two new tables, tentatively called "files" and "collection_entries".
12
13 3 Peter Amstutz
The "files" table contains the manifest with the file blocks and segments, keyed on the hash of the file contents.
14 1 Peter Amstutz
15 3 Peter Amstutz
The "collection_entries" table contains a collection uuid, file name, and key into the file table.
16 1 Peter Amstutz
17
We can then search the collection_entries table to find files that match a given file name pattern, or search the collections with files containing the desired content.  We could also make metadata assertions about files independently from the collection(s) the file is located in.
18
19
h2. Extensions
20
21
If manifests are required to be normalized, we could dispense with manifest_text altogether in the Collections table and recreate the manifest text on demand based on CollectionEntries.
22 3 Peter Amstutz
23
When uploading, it could hash the local file and check to see if the file already exists before uploading.
24 1 Peter Amstutz
25
We could add a "blocks" table which links Keep blocks to records in the "file" table (and to Collections through an additional join).  This would enable reverse lookup of blocks to determine if a user is allowed to know if a block already exists.  This would permit "should I upload this block or not?" queries enabling arv-put logic that can skip sending blocks that are already present on the remote instance.