Project

General

Profile

Actions

Feature #15106

closed

[API] Index 'like' queries and use for search

Added by Peter Amstutz almost 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Eric Biagiotti
Category:
-
Target version:
Story points:
2.0
Release relationship:
Auto

Description

Add a migration that creates trigram indexes on columns currently searchable by full text search. This should concatenate columns to create a single expression index (eg name || description || file_names).

Migration should execute "create extension pg_trgm" and provide a useful error message if the extension is not available (should advise the user to install postgres-contrib and/or link to the migration notes).

Update workbench2 search bar to use ["any", "ilike", "%term%"] instead of '@@'. Workbench should perform splitting of the search on whitespace and add a separate search clause for each term.


Subtasks 1 (0 open1 closed)

Task #15264: Review 15106-trgm-text-searchResolvedEric Biagiotti06/14/2019Actions

Related issues

Related to Arvados - Feature #14573: [Spike] [API] Fully functional filename searchResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz almost 5 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz almost 5 years ago

  • Subject changed from Indexed like queries to [API] Index 'like' queries for search
  • Status changed from In Progress to New
Actions #3

Updated by Peter Amstutz almost 5 years ago

  • Related to Feature #14573: [Spike] [API] Fully functional filename search added
Actions #4

Updated by Peter Amstutz almost 5 years ago

  • Subject changed from [API] Index 'like' queries for search to [API] Index 'like' queries and use for search
  • Description updated (diff)
Actions #5

Updated by Peter Amstutz almost 5 years ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz almost 5 years ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz almost 5 years ago

  • Description updated (diff)
  • Story points set to 2.0
Actions #8

Updated by Tom Morris almost 5 years ago

  • Target version changed from To Be Groomed to Arvados Future Sprints
Actions #9

Updated by Tom Morris almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2019-06-05 Sprint
Actions #10

Updated by Eric Biagiotti almost 5 years ago

  • Assigned To set to Eric Biagiotti
Actions #11

Updated by Eric Biagiotti almost 5 years ago

A few questions/comments before writing the migration:

- Looking at our full text search indexes, I noticed we are using COALESCE. I just wanted to double check whether this is needed for creating the trgm index.

- I also noticed from the examples on the spike that when creating the index, columns are separated by || ' ' ||. I'm not sure what this does. Any explanation would be helpful.

- If I created the trgm index with the following:

create index collection_trgm_idx on public.collections USING gin ((file_names || ' ' || (properties)::text) public.gin_trgm_ops)

Then the following where clause is required to use the trgm index:

select * from collections where (file_names || ' ' || (properties)::text) ilike '%176DEMH%'" 

A regular query like the following will not use the trgm index:

select * from collections where file_names ilike '%176DEMH%'" 

I haven't dug into record_filter in full detail yet, but I'm not sure it will satisfy the above. For example, the "any" attribute uses the searchable_columns function, which doesn't appear to handle jsonb types. Could it be possible that we need another operator similar to @@, or am I way off base?

Actions #12

Updated by Peter Amstutz almost 5 years ago

Eric Biagiotti wrote:

A few questions/comments before writing the migration:

- Looking at our full text search indexes, I noticed we are using COALESCE. I just wanted to double check whether this is needed for creating the trgm index.

I think that's because it isn't legal to concatenate a string with NULL. COALESCE specifies that if the column is null, use an empty string instead.

- I also noticed from the examples on the spike that when creating the index, columns are separated by || ' ' ||. I'm not sure what this does. Any explanation would be helpful.

|| is the string concatenation operator. This is just adding a space between each column text.

- If I created the trgm index with the following:

create index collection_trgm_idx on public.collections USING gin ((file_names || ' ' || (properties)::text) public.gin_trgm_ops)

Then the following where clause is required to use the trgm index:

select * from collections where (file_names || ' ' || (properties)::text) ilike '%176DEMH%'"

A regular query like the following will not use the trgm index:

select * from collections where file_names ilike '%176DEMH%'"

The trick here is that we want to build the index without actually creating and storing an additional column, so we make an "expression index". In order to use the index, you have to query on exactly the same expression.

I haven't dug into record_filter in full detail yet, but I'm not sure it will satisfy the above. For example, the "any" attribute uses the searchable_columns function, which doesn't appear to handle jsonb types. Could it be possible that we need another operator similar to @@, or am I way off base?

We're only supporting searches on 'any'. The record_filter code should be able to recognize ["any", "ilike", "%176DEMH%"] and build the appropriate query to use the trigram index.

You should be able to coerce jsonb columns to a string representation. It might be that you are doing it already with properties::text.

Actions #13

Updated by Eric Biagiotti almost 5 years ago

Some migrations statistics using the e51c5 data on my laptop.

Table Name # of Rows Index Creation Time (s)
collections 4228696 575.6284
container_requests 873201 219.8006
groups 1789 0.1482
jobs 359324 134.3642
pipeline_instances 10721 31.8514
pipeline_templates 6 0.0074
workflows 19 0.0024

Total Migration Time: ~ 16 minutes

Actions #14

Updated by Eric Biagiotti almost 5 years ago

Some questions/comments about needing to be a postgres superuser to execute "CREATE EXTENSION IF NOT EXISTS pg_trgm". This prevents us from creating trgm indexes in a database migration unless pg_trgm is already enabled or we can temporarily grant superuser permissions in the migration. 3 use cases to consider: run-tests, arvbox, and deployment.

  • This project (https://github.com/dimitri/pgextwlist) is meant to make it possible to add extensions as a non-super user. Seems straight forward to add, but I'd want to get input before going down this road. Not sure of potential issues with deployment, etc.
  • For deployment, if we use a post-install script to enable the extension, it would have to run before db:migrate but after db:create.
  • We could use a template database for run-tests/arvbox or any instance when creating the database for the first time.
  • Could we create a new rake task to slide in between db:create and db:migrate? Can this be run as a super user?
Actions #15

Updated by Eric Biagiotti almost 5 years ago

  • Target version changed from 2019-06-05 Sprint to 2019-06-19 Sprint
Actions #16

Updated by Eric Biagiotti almost 5 years ago

  • Status changed from New to In Progress
Actions #17

Updated by Ward Vandewege almost 5 years ago

  • Release set to 22
Actions #18

Updated by Eric Biagiotti almost 5 years ago

Latest at 5081dd1ed966f76f58c9e30ad716967f14dc1991

  • Added full_text_trgm function to ArvadosModel to create the string required for creating the trgm index and querying with it. Uses much of the same code from full_text_tsvector, but I didn't change that function since its used in other migrations.
  • Added trigram search to the api for ["any", "ilike", %example%].
  • Added test to make sure trigram indexes exist on the right tables/column names.
  • Updated arvbox to install postgresql-contrib and create a superuser
  • Updated https://dev.arvados.org/projects/arvados/wiki/Hacking_prerequisites to include postgresql-contrib/superuser instructions
  • Added a note to the upgrade docs and updated the install docs to include the necessary steps for the migration to run.

Latest WB2 at 9f12beba0400015a833e2719721bf3369f4f2d94

  • Updated WB2 to use trgm indexing on full text search.

Tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1309/
WB Re-runs: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-functionals/1330/ https://ci.curoverse.com/view/Developer/job/developer-run-tests-apps-workbench-integration/1358/

Some performance metrics from running arv collection list --filter on 3049 collections returning 80 results:

Mode Filter Resulting Query Time
Trigram ["any", "ilike", "%test%"]
SELECT ... FROM collections WHERE (((coalesce(owner_uuid,'') || ' ' || coalesce(modified_by_client_uuid,'') 
... ilike '%test%'))
1.9 ms
No Trigram ["any", "ilike", "%test%"]
SELECT ... FROM collections WHERE ((owner_uuid ilike '%test%' OR ... current_version_uuid ilike '%test%')) 
17.8 ms
Full Text Search ["any", "@@", "%test%"]
SELECT ... FROM collections WHERE (((coalesce(owner_uuid,'') || ' ' || coalesce(modified_by_client_uuid,'') 
... ilike '%test%' OR to_tsvector('english', substr(coalesce(owner_uuid,'') || ' ' || ... 0, 8000)) 
@@ to_tsquery('%test%')))
112.4 ms

Anything special I need to do for packaging? I built and tested packaging locally and it worked fine, but I noticed the centos7 docker file installs some postgres dependencies.

Actions #19

Updated by Peter Amstutz almost 5 years ago

if attrs_in == 'any' && operator.casecmp('ilike') && (operand.is_a? String) && operand.match('^[%].*[%]$')
  • Should include "like" (regular case-sensitive match) as well as "ilike" (case-insensitive).
  • What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?
Actions #20

Updated by Eric Biagiotti almost 5 years ago

Peter Amstutz wrote:

[...]

  • Should include "like" (regular case-sensitive match) as well as "ilike" (case-insensitive).

Fixed in 0719567935227976b0331c9fd231a184bcabdc1a

  • What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?

In this case, it falls through and model_class.searchable_columns (ln 139) is used to construct a query similar to row 2 in the table from note 18.

I was under the assumption that we only wanted to use the trigram index with operands that begin and end with '%'. Note that the trigram indexes are built using ArvadosModel.full_text_searchable_columns, which is not necessarily a superset of ArvadosModel.searchable_columns, which is what 'like/ilike' are using by default.

Actions #21

Updated by Peter Amstutz almost 5 years ago

Eric Biagiotti wrote:

  • What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?

In this case, it falls through and model_class.searchable_columns (ln 139) is used to construct a query similar to row 2 in the table from note 18.

Got it. That will do a full table scan but at least it will do something reasonable.

I was under the assumption that we only wanted to use the trigram index with operands that begin and end with '%'. Note that the trigram indexes are built using ArvadosModel.full_text_searchable_columns, which is not necessarily a superset of ArvadosModel.searchable_columns, which is what 'like/ilike' are using by default.

Yes, since they are all combined into a single string that makes sense. If we need to index for leading/trailing matches we can index individual columns.

Actions #22

Updated by Peter Amstutz almost 5 years ago

0719567935227976b0331c9fd231a184bcabdc1a LGTM

(good catch adding .zero? to casecmp)

I kicked off tests:

https://ci.curoverse.com/job/developer-run-tests-services-api/1355/

Actions #23

Updated by Eric Biagiotti almost 5 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF