Feature #15106
closed[API] Index 'like' queries and use for search
Description
Add a migration that creates trigram indexes on columns currently searchable by full text search. This should concatenate columns to create a single expression index (eg name || description || file_names).
Migration should execute "create extension pg_trgm" and provide a useful error message if the extension is not available (should advise the user to install postgres-contrib and/or link to the migration notes).
Update workbench2 search bar to use ["any", "ilike", "%term%"] instead of '@@'. Workbench should perform splitting of the search on whitespace and add a separate search clause for each term.
Updated by Peter Amstutz over 5 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 5 years ago
- Subject changed from Indexed like queries to [API] Index 'like' queries for search
- Status changed from In Progress to New
Updated by Peter Amstutz over 5 years ago
- Related to Feature #14573: [Spike] [API] Fully functional filename search added
Updated by Peter Amstutz over 5 years ago
- Subject changed from [API] Index 'like' queries for search to [API] Index 'like' queries and use for search
- Description updated (diff)
Updated by Peter Amstutz over 5 years ago
- Description updated (diff)
- Story points set to 2.0
Updated by Tom Morris over 5 years ago
- Target version changed from To Be Groomed to Arvados Future Sprints
Updated by Tom Morris over 5 years ago
- Target version changed from Arvados Future Sprints to 2019-06-05 Sprint
Updated by Eric Biagiotti over 5 years ago
A few questions/comments before writing the migration:
- Looking at our full text search indexes, I noticed we are using COALESCE. I just wanted to double check whether this is needed for creating the trgm index.
- I also noticed from the examples on the spike that when creating the index, columns are separated by || ' ' ||
. I'm not sure what this does. Any explanation would be helpful.
- If I created the trgm index with the following:
create index collection_trgm_idx on public.collections USING gin ((file_names || ' ' || (properties)::text) public.gin_trgm_ops)
Then the following where clause is required to use the trgm index:
select * from collections where (file_names || ' ' || (properties)::text) ilike '%176DEMH%'"
A regular query like the following will not use the trgm index:
select * from collections where file_names ilike '%176DEMH%'"
I haven't dug into record_filter in full detail yet, but I'm not sure it will satisfy the above. For example, the "any" attribute uses the searchable_columns function, which doesn't appear to handle jsonb types. Could it be possible that we need another operator similar to @@, or am I way off base?
Updated by Peter Amstutz over 5 years ago
Eric Biagiotti wrote:
A few questions/comments before writing the migration:
- Looking at our full text search indexes, I noticed we are using COALESCE. I just wanted to double check whether this is needed for creating the trgm index.
I think that's because it isn't legal to concatenate a string with NULL. COALESCE specifies that if the column is null, use an empty string instead.
- I also noticed from the examples on the spike that when creating the index, columns are separated by
|| ' ' ||
. I'm not sure what this does. Any explanation would be helpful.
||
is the string concatenation operator. This is just adding a space between each column text.
- If I created the trgm index with the following:
create index collection_trgm_idx on public.collections USING gin ((file_names || ' ' || (properties)::text) public.gin_trgm_ops)
Then the following where clause is required to use the trgm index:
select * from collections where (file_names || ' ' || (properties)::text) ilike '%176DEMH%'"
A regular query like the following will not use the trgm index:
select * from collections where file_names ilike '%176DEMH%'"
The trick here is that we want to build the index without actually creating and storing an additional column, so we make an "expression index". In order to use the index, you have to query on exactly the same expression.
I haven't dug into record_filter in full detail yet, but I'm not sure it will satisfy the above. For example, the "any" attribute uses the searchable_columns function, which doesn't appear to handle jsonb types. Could it be possible that we need another operator similar to @@, or am I way off base?
We're only supporting searches on 'any'. The record_filter code should be able to recognize ["any", "ilike", "%176DEMH%"]
and build the appropriate query to use the trigram index.
You should be able to coerce jsonb columns to a string representation. It might be that you are doing it already with properties::text
.
Updated by Eric Biagiotti over 5 years ago
Some migrations statistics using the e51c5 data on my laptop.
Table Name | # of Rows | Index Creation Time (s) |
collections | 4228696 | 575.6284 |
container_requests | 873201 | 219.8006 |
groups | 1789 | 0.1482 |
jobs | 359324 | 134.3642 |
pipeline_instances | 10721 | 31.8514 |
pipeline_templates | 6 | 0.0074 |
workflows | 19 | 0.0024 |
Total Migration Time: ~ 16 minutes
Updated by Eric Biagiotti over 5 years ago
Some questions/comments about needing to be a postgres superuser to execute "CREATE EXTENSION IF NOT EXISTS pg_trgm". This prevents us from creating trgm indexes in a database migration unless pg_trgm is already enabled or we can temporarily grant superuser permissions in the migration. 3 use cases to consider: run-tests, arvbox, and deployment.
- This project (https://github.com/dimitri/pgextwlist) is meant to make it possible to add extensions as a non-super user. Seems straight forward to add, but I'd want to get input before going down this road. Not sure of potential issues with deployment, etc.
- For deployment, if we use a post-install script to enable the extension, it would have to run before db:migrate but after db:create.
- We could use a template database for run-tests/arvbox or any instance when creating the database for the first time.
- Could we create a new rake task to slide in between db:create and db:migrate? Can this be run as a super user?
Updated by Eric Biagiotti over 5 years ago
- Target version changed from 2019-06-05 Sprint to 2019-06-19 Sprint
Updated by Eric Biagiotti over 5 years ago
- Status changed from New to In Progress
Updated by Eric Biagiotti over 5 years ago
Latest at 5081dd1ed966f76f58c9e30ad716967f14dc1991
- Added full_text_trgm function to ArvadosModel to create the string required for creating the trgm index and querying with it. Uses much of the same code from full_text_tsvector, but I didn't change that function since its used in other migrations.
- Added trigram search to the api for
["any", "ilike", %example%]
. - Added test to make sure trigram indexes exist on the right tables/column names.
- Updated arvbox to install postgresql-contrib and create a superuser
- Updated https://dev.arvados.org/projects/arvados/wiki/Hacking_prerequisites to include postgresql-contrib/superuser instructions
- Added a note to the upgrade docs and updated the install docs to include the necessary steps for the migration to run.
Latest WB2 at 9f12beba0400015a833e2719721bf3369f4f2d94
- Updated WB2 to use trgm indexing on full text search.
Tests: https://ci.curoverse.com/view/Developer/job/developer-run-tests/1309/
WB Re-runs: https://ci.curoverse.com/job/developer-run-tests-apps-workbench-functionals/1330/ https://ci.curoverse.com/view/Developer/job/developer-run-tests-apps-workbench-integration/1358/
Some performance metrics from running arv collection list --filter
on 3049 collections returning 80 results:
Mode | Filter | Resulting Query | Time |
Trigram | ["any", "ilike", "%test%"] | SELECT ... FROM collections WHERE (((coalesce(owner_uuid,'') || ' ' || coalesce(modified_by_client_uuid,'') ... ilike '%test%')) |
1.9 ms |
No Trigram | ["any", "ilike", "%test%"] | SELECT ... FROM collections WHERE ((owner_uuid ilike '%test%' OR ... current_version_uuid ilike '%test%')) |
17.8 ms |
Full Text Search | ["any", "@@", "%test%"] | SELECT ... FROM collections WHERE (((coalesce(owner_uuid,'') || ' ' || coalesce(modified_by_client_uuid,'') ... ilike '%test%' OR to_tsvector('english', substr(coalesce(owner_uuid,'') || ' ' || ... 0, 8000)) @@ to_tsquery('%test%'))) |
112.4 ms |
Anything special I need to do for packaging? I built and tested packaging locally and it worked fine, but I noticed the centos7 docker file installs some postgres dependencies.
Updated by Peter Amstutz over 5 years ago
if attrs_in == 'any' && operator.casecmp('ilike') && (operand.is_a? String) && operand.match('^[%].*[%]$')
- Should include "like" (regular case-sensitive match) as well as "ilike" (case-insensitive).
- What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?
Updated by Eric Biagiotti over 5 years ago
Peter Amstutz wrote:
[...]
- Should include "like" (regular case-sensitive match) as well as "ilike" (case-insensitive).
Fixed in 0719567935227976b0331c9fd231a184bcabdc1a
- What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?
In this case, it falls through and model_class.searchable_columns (ln 139) is used to construct a query similar to row 2 in the table from note 18.
I was under the assumption that we only wanted to use the trigram index with operands that begin and end with '%'. Note that the trigram indexes are built using ArvadosModel.full_text_searchable_columns, which is not necessarily a superset of ArvadosModel.searchable_columns, which is what 'like/ilike' are using by default.
Updated by Peter Amstutz over 5 years ago
Eric Biagiotti wrote:
- What happens if the user request like "any like foo" so operand.match('^[%].*[%]$') doesn't apply? It looks like it will just be ignored? Should that be an error?
In this case, it falls through and model_class.searchable_columns (ln 139) is used to construct a query similar to row 2 in the table from note 18.
Got it. That will do a full table scan but at least it will do something reasonable.
I was under the assumption that we only wanted to use the trigram index with operands that begin and end with '%'. Note that the trigram indexes are built using ArvadosModel.full_text_searchable_columns, which is not necessarily a superset of ArvadosModel.searchable_columns, which is what 'like/ilike' are using by default.
Yes, since they are all combined into a single string that makes sense. If we need to index for leading/trailing matches we can index individual columns.
Updated by Peter Amstutz over 5 years ago
0719567935227976b0331c9fd231a184bcabdc1a LGTM
(good catch adding .zero? to casecmp)
I kicked off tests:
https://ci.curoverse.com/job/developer-run-tests-services-api/1355/
Updated by Eric Biagiotti over 5 years ago
- Status changed from In Progress to Resolved