[Spike] [API] Fully functional filename search
- Indexing on text fields cannot handle medium-size text inputs.
- Indexing on to_tsvector(...) cannot handle certain large text inputs (limit depends on content, not just size). Result: crash when creating the index or when inserting a row, whichever happens last.
- Add a tsvector column. Populate it with to_tsvector(...) where possible. Where not possible, either populate with partial content (to_tsvector(substring(...))), or leave it null and adjust the search query to do an unindexed fulltext search on such rows. A function with an exception clause might work.
- Use something other than Postgresql for text search.
- Index of files in collections
Spike goal: validate that the Index of files in collections approach can return the desired results, and performs well on a production-size database.Suggested implementation:
- Retrieve all collections from a production-size cluster, extract the pdh/dir/file/size info, and insert into a table on a dev database.
- Try various ways of indexing/reformatting the dir/filenames so the example searches run quickly and return useful results.
- Provide table of speeds/results for various approaches.
#3 Updated by Peter Amstutz 3 months ago
Postgres text search has other problems when it comes to segmenting filenames. However I don't think that means we give up postgres, should create our own filename search table that has the behavior we want.
Maintain our own filename index in a new table.
keyword → collection PDH
Perform custom filename tokenizing and support prefix search with "like" (can use B-tree indexes). Split on symbols like "_", "-" and ".", CamelCase (lower&arr;upper transitions). Convert everything to lowercase. For example:
Would turn into:
Searches would be prefix searches, eg a search on "RMF1U7F" would be
keyword like 'rmf1u7f%'