Index of files in collections » History » Version 2
Tom Clegg, 02/20/2019 08:11 PM
1 | 1 | Tom Clegg | h1. Index of files in collections |
---|---|---|---|
2 | |||
3 | Currently the manifest_text column contains information about the individual files in collections. However, utility is limited because the data is not structured in a way that PostgreSQL understands. |
||
4 | * searching filenames is difficult/impossible because even the "list of filenames" column is too long for PostgreSQL to index properly. |
||
5 | * searching collections with a given block locator (or locator pattern, which is useful for partitioning keep-balance work) is inefficient. |
||
6 | |||
7 | These problems (and some other opportunities) could be addressed by keeping a separate table of files. |
||
8 | |||
9 | |pdh|dir|filename|bytesize|filehash†| |
||
10 | |abcd1234+123|foo/bar|baz.txt|1234|dcba4321| |
||
11 | |abcd1234+123|foo/bar|waz.txt|1235|efab8912| |
||
12 | |||
13 | † In general filehash cannot be computed just from the manifest. This column would presumably allow null ("not known") and might not exist at all. |
||
14 | |||
15 | New rows would be added to the files table whenever a collection is saved with a PDH that isn't already present. |
||
16 | |||
17 | Old rows would be deleted from the files table whenever the last remaining collection with a given PDH is removed. |
||
18 | |||
19 | Once this table is populated, searching collection filenames would be implemented by searching the files table and joining the collections table on PDH. |
||
20 | 2 | Tom Clegg | |
21 | Whatever the index/search mechanism is, it should be able to find "Sample_RMF1U7F_S27_R1_001.fastq.gz" by searching for the following strings: |
||
22 | |||
23 | "sample_rmf1u7f_s27_r1_001.fastq.gz" (or "sample*") |
||
24 | "rmf1u7f_s27_r1_001.fastq.gz" (or "rmf1u7f*") |
||
25 | "s27_r1_001.fastq.gz" (...) |
||
26 | "r1_001.fastq.gz" |
||
27 | "001.fastq.gz" |
||
28 | "fastq.gz" |
||
29 | "gz" |