Story #14611

[Epic] Site-wide search for text, filenames, data

Added by Tom Clegg about 2 years ago. Updated almost 2 years ago.

Status:
Duplicate
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Arvados has had a "site-wide search" feature but it often fails to meet users' expectations.
  • Full-text search doesn't find exact strings (#13508) and doesn't index all filenames in large collections (#13752, #14560).
  • Substring search is slow, and doesn't index full rows (this is why full-text search was added).
  • No facility at all for searching file contents.

It is possible that we can use PostgreSQL's full-text search to address everything short of searching file contents, with a bit more work on our side (use a dictionary/language other than English, create a table of filenames instead of searching a huge text field with a list of filenames, etc.)

Another approach would be to use a separate tool to index/search the database, and apply Arvados permissions to those results. This could conceivably index file contents as well as database rows.


Related issues

Related to Arvados - Story #13508: Fix postgres search for filenamesDuplicate

Related to Arvados - Bug #14560: [1.3.0] error: ERROR: string is too long for tsvector (2299194 bytes, max 1048575 bytes)Resolved

Related to Arvados - Bug #6382: [Workbench] Searching through a collection using regex should accept $ instead of \nClosed06/22/2015

Is duplicate of Arvados - Feature #14573: [Spike] [API] Fully functional filename searchResolved

History

#1 Updated by Tom Clegg about 2 years ago

  • Related to Story #13508: Fix postgres search for filenames added

#3 Updated by Tom Clegg about 2 years ago

  • Related to Bug #14560: [1.3.0] error: ERROR: string is too long for tsvector (2299194 bytes, max 1048575 bytes) added

#4 Updated by Tom Clegg about 2 years ago

  • Related to Bug #6382: [Workbench] Searching through a collection using regex should accept $ instead of \n added

#5 Updated by Peter Amstutz about 2 years ago

I like the idea of a hybrid solution that uses PG full text search for name/description etc fields and uses a specialized database for indexing collection contents, both filenames and contents of documents. We need to be careful we don't start storing reads from fastq files in the full text database though.

#6 Updated by Tom Morris almost 2 years ago

  • Target version set to To Be Groomed

#7 Updated by Tom Clegg almost 2 years ago

  • Is duplicate of Feature #14573: [Spike] [API] Fully functional filename search added

#8 Updated by Tom Clegg almost 2 years ago

  • Status changed from New to Duplicate
  • Target version deleted (To Be Groomed)

Also available in: Atom PDF