Feature #14573

[Spike] [API] Fully functional filename search

Added by Tom Clegg 3 months ago. Updated 3 days ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
2.0

Description

See #13752, #14560 for previous attempts.
  • Indexing on text fields cannot handle medium-size text inputs.
  • Indexing on to_tsvector(...) cannot handle certain large text inputs (limit depends on content, not just size). Result: crash when creating the index or when inserting a row, whichever happens last.
Approaches that have been considered:
  • Add a tsvector column. Populate it with to_tsvector(...) where possible. Where not possible, either populate with partial content (to_tsvector(substring(...))), or leave it null and adjust the search query to do an unindexed fulltext search on such rows. A function with an exception clause might work.
  • Use something other than Postgresql for text search.
  • Index of files in collections

Spike goal: validate that the Index of files in collections approach can return the desired results, and performs well on a production-size database.

Suggested implementation:
  • Retrieve all collections from a production-size cluster, extract the pdh/dir/file/size info, and insert into a table on a dev database.
  • Try various ways of indexing/reformatting the dir/filenames so the example searches run quickly and return useful results.
  • Provide table of speeds/results for various approaches.

Related issues

Has duplicate Arvados - Story #13508: Fix postgres search for filenamesDuplicate

Has duplicate Arvados - Story #14611: [Epic] Site-wide search for text, filenames, dataDuplicate

History

#1 Updated by Tom Clegg 3 months ago

  • Tracker changed from Bug to Feature

#2 Updated by Tom Clegg 3 months ago

  • Description updated (diff)

#3 Updated by Peter Amstutz 3 months ago

Postgres text search has other problems when it comes to segmenting filenames. However I don't think that means we give up postgres, should create our own filename search table that has the behavior we want.

https://dev.arvados.org/issues/13752#note-7

Proposed solution:

Maintain our own filename index in a new table.

keyword → collection PDH

Perform custom filename tokenizing and support prefix search with "like" (can use B-tree indexes). Split on symbols like "_", "-" and ".", CamelCase (lower&arr;upper transitions). Convert everything to lowercase. For example:

"Sample_RMF1U7F_S27_R1_001.fastq.gz"

Would turn into:

"sample_rmf1u7f_s27_r1_001.fastq.gz"
"rmf1u7f_s27_r1_001.fastq.gz"
"s27_r1_001.fastq.gz"
"r1_001.fastq.gz"
"001.fastq.gz"
"fastq.gz"
"gz"

Searches would be prefix searches, eg a search on "RMF1U7F" would be keyword like 'rmf1u7f%'

#4 Updated by Peter Amstutz 3 days ago

Table of:

(filename, portable data hash of collection, path in collection, file size, content hash)

#5 Updated by Tom Clegg 3 days ago

  • Description updated (diff)

#6 Updated by Tom Clegg 3 days ago

  • Subject changed from [API] Fully functional filename search to [Spike] [API] Fully functional filename search
  • Description updated (diff)
  • Target version changed from To Be Groomed to Arvados Future Sprints
  • Story points set to 2.0

#7 Updated by Tom Clegg 3 days ago

  • Has duplicate Story #13508: Fix postgres search for filenames added

#8 Updated by Tom Clegg 3 days ago

  • Has duplicate Story #14611: [Epic] Site-wide search for text, filenames, data added

Also available in: Atom PDF