Feature #15106

[API] Index 'like' queries and use for search

Added by Peter Amstutz 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
06/14/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0
Release relationship:
Auto

Description

Add a migration that creates trigram indexes on columns currently searchable by full text search. This should concatenate columns to create a single expression index (eg name || description || file_names).

Migration should execute "create extension pg_trgm" and provide a useful error message if the extension is not available (should advise the user to install postgres-contrib and/or link to the migration notes).

Update workbench2 search bar to use ["any", "ilike", "%term%"] instead of '@@'. Workbench should perform splitting of the search on whitespace and add a separate search clause for each term.


Subtasks

Task #15264: Review 15106-trgm-text-searchResolvedEric Biagiotti


Related issues

Related to Arvados - Feature #14573: [Spike] [API] Fully functional filename searchResolved

Associated revisions

Revision 442a871e
Added by Eric Biagiotti about 1 month ago

Merge branch 'master' into 15106-trgm-text-search

refs #15106

Arvados-DCO-1.1-Signed-off-by: Eric Biagiotti <>

Revision b4b8e120
Added by Eric Biagiotti about 1 month ago

Merge remote-tracking branch 'origin/master' into 15106-trgm-text-search

refs #15106

Arvados-DCO-1.1-Signed-off-by: Eric Biagiotti <>

Revision a7cdd1fa
Added by Eric Biagiotti about 1 month ago

Merge branch '15106-trgm-text-search'

refs #15106

Arvados-DCO-1.1-Signed-off-by: Eric Biagiotti <>

History

#1 Updated by Peter Amstutz 3 months ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz 3 months ago

  • Subject changed from Indexed like queries to [API] Index 'like' queries for search
  • Status changed from In Progress to New

#3 Updated by Peter Amstutz 3 months ago

  • Related to Feature #14573: [Spike] [API] Fully functional filename search added

#4 Updated by Peter Amstutz 3 months ago

  • Subject changed from [API] Index 'like' queries for search to [API] Index 'like' queries and use for search
  • Description updated (diff)

#5 Updated by Peter Amstutz 3 months ago

  • Description updated (diff)

#6 Updated by Peter Amstutz 3 months ago

  • Description updated (diff)

#7 Updated by Peter Amstutz 3 months ago

  • Description updated (diff)
  • Story points set to 2.0

#8 Updated by Tom Morris 3 months ago

  • Target version changed from To Be Groomed to Arvados Future Sprints

#9 Updated by Tom Morris 2 months ago

  • Target version changed from Arvados Future Sprints to 2019-06-05 Sprint

#10 Updated by Eric Biagiotti 2 months ago

  • Assigned To set to Eric Biagiotti

#11 Updated by Eric Biagiotti about 2 months ago

A few questions/comments before writing the migration:

- Looking at our full text search indexes, I noticed we are using COALESCE. I just wanted to double check whether this is needed for creating the trgm index.

- I also noticed from the examples on the spike that when creating the index, columns are separated by || ' ' ||. I'm not sure what this does. Any explanation would be helpful.

- If I created the trgm index with the following:

create index collection_trgm_idx on public.collections USING gin ((file_names || ' ' || (properties)::text) public.gin_trgm_ops)

Then the following where clause is required to use the trgm index:

select * from collections where (file_names || ' ' || (properties)::text) ilike '%176DEMH%'" 

A regular query like the following will not use the trgm index:

select * from collections where file_names ilike '%176DEMH%'" 

I haven't dug into record_filter in full detail yet, but I'm not sure it will satisfy the above. For example, the "any" attribute uses the searchable_columns function, which doesn't appear to handle jsonb types. Could it be possible that we need another operator similar to @@, or am I way off base?

#12 Updated by Peter Amstutz about 2 months ago

Eric Biagiotti wrote:

A few questions/comments before writing the migration:

- Looking at our full text search indexes, I noticed we are using COALESCE. I just wanted to double check whether this is needed for creating the trgm index.

I think that's because it isn't legal to concatenate a string with NULL. COALESCE specifies that if the column is null, use an empty string instead.

- I also noticed from the examples on the spike that when creating the index, columns are separated by || ' ' ||. I'm not sure what this does. Any explanation would be helpful.

|| is the string concatenation operator. This is just adding a space between each column text.

- If I created the trgm index with the following:

create index collection_trgm_idx on public.collections USING gin ((file_names || ' ' || (properties)::text) public.gin_trgm_ops)

Then the following where clause is required to use the trgm index:

select * from collections where (file_names || ' ' || (properties)::text) ilike '%176DEMH%'"

A regular query like the following will not use the trgm index:

select * from collections where file_names ilike '%176DEMH%'"

The trick here is that we want to build the index without actually creating and storing an additional column, so we make an "expression index". In order to use the index, you have to query on exactly the same expression.

I haven't dug into record_filter in full detail yet, but I'm not sure it will satisfy the above. For example, the "any" attribute uses the searchable_columns function, which doesn't appear to handle jsonb types. Could it be possible that we need another operator similar to @@, or am I way off base?

We're only supporting searches on 'any'. The record_filter code should be able to recognize ["any", "ilike", "%176DEMH%"] and build the appropriate query to use the trigram index.

You should be able to coerce jsonb columns to a string representation. It might be that you are doing it already with properties::text.

#13 Updated by Eric Biagiotti about 2 months ago

Some migrations statistics using the e51c5 data on my laptop.

Table Name # of Rows Index Creation Time (s)
collections 4228696 575.6284
container_requests 873201 219.8006
groups 1789 0.1482
jobs 359324 134.3642
pipeline_instances 10721 31.8514
pipeline_templates 6 0.0074
workflows 19 0.0024

Total Migration Time: ~ 16 minutes

#14 Updated by Eric Biagiotti about 2 months ago

Some questions/comments about needing to be a postgres superuser to execute "CREATE EXTENSION IF NOT EXISTS pg_trgm". This prevents us from creating trgm indexes in a database migration unless pg_trgm is already enabled or we can temporarily grant superuser permissions in the migration. 3 use cases to consider: run-tests, arvbox, and deployment.

  • This project (https://github.com/dimitri/pgextwlist) is meant to make it possible to add extensions as a non-super user. Seems straight forward to add, but I'd want to get input before going down this road. Not sure of potential issues with deployment, etc.
  • For deployment, if we use a post-install script to enable the extension, it would have to run before db:migrate but after db:create.
  • We could use a template database for run-tests/arvbox or any instance when creating the database for the first time.
  • Could we create a new rake task to slide in between db:create and db:migrate? Can this be run as a super user?

#15 Updated by Eric Biagiotti about 2 months ago

  • Target version changed from 2019-06-05 Sprint to 2019-06-19 Sprint

#16 Updated by Eric Biagiotti about 2 months ago

  • Status changed from New to In Progress

#17 Updated by Ward Vandewege about 1 month ago

  • Release set to 22

#18 Updated by Eric Biagiotti about 1 month ago

Latest at 5081dd1ed966f76f58c9e30ad716967f14dc1991

  • Added full_text_trgm function to ArvadosModel to create the string required for creating the trgm index and querying with it. Uses much of the same code from full_text_tsvector, but I didn't change that function since its used in other migrations.
  • Added trigram search to the api for ["any", "ilike", %example%].
  • Added test to make sure trigram indexes exist on the right tables/column names.
  • Updated arvbox to install postgresql-contrib and create a superuser
  • Updated https://dev.arvados.org/projects/arvados/wiki/Hacking_prerequisites to include postgresql-contrib/superuser instructions
  • Added a note to the upgrade docs and updated the install docs to include the necessary steps for the migration to run.

Latest WB2 at 9f12beba0400015a833e2719721bf3369f4f2d94

  • Updated WB2 to use trgm indexing on full text search.

Tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1309/
WB Re-runs: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-functionals/1330/ https://ci.curoverse.com/view/Developer/job/developer-run-tests-apps-workbench-integration/1358/

Some performance metrics from running arv collection list --filter on 3049 collections returning 80 results:

Mode Filter Resulting Query Time
Trigram ["any", "ilike", "%test%"]
SELECT ... FROM collections WHERE (((coalesce(owner_uuid,'') || ' ' || coalesce(modified_by_client_uuid,'') 
... ilike '%test%'))
1.9 ms
No Trigram ["any", "ilike", "%test%"]
SELECT ... FROM collections WHERE ((owner_uuid ilike '%test%' OR ... current_version_uuid ilike '%test%')) 
17.8 ms
Full Text Search ["any", "@@", "%test%"]
SELECT ... FROM collections WHERE (((coalesce(owner_uuid,'') || ' ' || coalesce(modified_by_client_uuid,'') 
... ilike '%test%' OR to_tsvector('english', substr(coalesce(owner_uuid,'') || ' ' || ... 0, 8000)) 
@@ to_tsquery('%test%')))
112.4 ms

Anything special I need to do for packaging? I built and tested packaging locally and it worked fine, but I noticed the centos7 docker file installs some postgres dependencies.

#19 Updated by Peter Amstutz about 1 month ago

if attrs_in == 'any' && operator.casecmp('ilike') && (operand.is_a? String) && operand.match('^[%].*[%]$')
  • Should include "like" (regular case-sensitive match) as well as "ilike" (case-insensitive).
  • What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?

#20 Updated by Eric Biagiotti about 1 month ago

Peter Amstutz wrote:

[...]

  • Should include "like" (regular case-sensitive match) as well as "ilike" (case-insensitive).

Fixed in 0719567935227976b0331c9fd231a184bcabdc1a

  • What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?

In this case, it falls through and model_class.searchable_columns (ln 139) is used to construct a query similar to row 2 in the table from note 18.

I was under the assumption that we only wanted to use the trigram index with operands that begin and end with '%'. Note that the trigram indexes are built using ArvadosModel.full_text_searchable_columns, which is not necessarily a superset of ArvadosModel.searchable_columns, which is what 'like/ilike' are using by default.

#21 Updated by Peter Amstutz about 1 month ago

Eric Biagiotti wrote:

  • What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?

In this case, it falls through and model_class.searchable_columns (ln 139) is used to construct a query similar to row 2 in the table from note 18.

Got it. That will do a full table scan but at least it will do something reasonable.

I was under the assumption that we only wanted to use the trigram index with operands that begin and end with '%'. Note that the trigram indexes are built using ArvadosModel.full_text_searchable_columns, which is not necessarily a superset of ArvadosModel.searchable_columns, which is what 'like/ilike' are using by default.

Yes, since they are all combined into a single string that makes sense. If we need to index for leading/trailing matches we can index individual columns.

#22 Updated by Peter Amstutz about 1 month ago

0719567935227976b0331c9fd231a184bcabdc1a LGTM

(good catch adding .zero? to casecmp)

I kicked off tests:

https://ci.curoverse.com/job/developer-run-tests-services-api/1355/

#23 Updated by Eric Biagiotti about 1 month ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF