Index of files in collections » History » Version 1
Tom Clegg, 02/20/2019 06:55 PM
1 | 1 | Tom Clegg | h1. Index of files in collections |
---|---|---|---|
2 | |||
3 | Currently the manifest_text column contains information about the individual files in collections. However, utility is limited because the data is not structured in a way that PostgreSQL understands. |
||
4 | * searching filenames is difficult/impossible because even the "list of filenames" column is too long for PostgreSQL to index properly. |
||
5 | * searching collections with a given block locator (or locator pattern, which is useful for partitioning keep-balance work) is inefficient. |
||
6 | |||
7 | These problems (and some other opportunities) could be addressed by keeping a separate table of files. |
||
8 | |||
9 | |pdh|dir|filename|bytesize|filehash†| |
||
10 | |abcd1234+123|foo/bar|baz.txt|1234|dcba4321| |
||
11 | |abcd1234+123|foo/bar|waz.txt|1235|efab8912| |
||
12 | |||
13 | † In general filehash cannot be computed just from the manifest. This column would presumably allow null ("not known") and might not exist at all. |
||
14 | |||
15 | New rows would be added to the files table whenever a collection is saved with a PDH that isn't already present. |
||
16 | |||
17 | Old rows would be deleted from the files table whenever the last remaining collection with a given PDH is removed. |
||
18 | |||
19 | Once this table is populated, searching collection filenames would be implemented by searching the files table and joining the collections table on PDH. |