Bug #14070
closedAPI DB needs an index on collections name
Description
Our client code that uploads data into keep looks up collections by name to see if they already exist. These queries can be very slow, in particular when the name does not exist.
The query plan for this ends up looking like this:
arvados_api_production=# explain analyze SELECT collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram')) ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=708289.84..708289.85 rows=1 width=893) (actual time=419589.665..419589.665 rows=0 loops=1) -> Sort (cost=708289.84..708289.85 rows=1 width=893) (actual time=419589.663..419589.663 rows=0 loops=1) Sort Key: collections.modified_at DESC, collections.uuid Sort Method: quicksort Memory: 25kB -> Nested Loop Anti Join (cost=0.69..708289.83 rows=1 width=893) (actual time=419589.652..419589.652 rows=0 loops=1) Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text) -> Index Scan using index_collections_on_owner_uuid_and_name on collections (cost=0.69..708288.34 rows=1 width=893) (actual time=419589.651..419589.651 rows=0 loops=1) Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text) -> Materialize (cost=0.00..1.48 rows=1 width=28) (never executed) -> Seq Scan on materialized_permission_view (cost=0.00..1.48 rows=1 width=28) (never executed) Filter: (trashed = 1) Planning time: 0.560 ms Execution time: 419589.734 ms (13 rows)
Once the index is cached in memory it is significantly faster, but still slow:
arvados_api_production=# explain analyze SELECT collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram')) ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=708210.57..708210.58 rows=1 width=893) (actual time=3141.044..3141.044 rows=0 loops=1) -> Sort (cost=708210.57..708210.58 rows=1 width=893) (actual time=3141.043..3141.043 rows=0 loops=1) Sort Key: collections.modified_at DESC, collections.uuid Sort Method: quicksort Memory: 25kB -> Nested Loop Anti Join (cost=0.69..708210.56 rows=1 width=893) (actual time=3141.035..3141.035 rows=0 loops=1) Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text) -> Index Scan using index_collections_on_owner_uuid_and_name on collections (cost=0.69..708209.07 rows=1 width=893) (actual time=3141.034..3141.034 rows=0 loops=1) Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text) -> Materialize (cost=0.00..1.48 rows=1 width=28) (never executed) -> Seq Scan on materialized_permission_view (cost=0.00..1.48 rows=1 width=28) (never executed) Filter: (trashed = 1) Planning time: 1.376 ms Execution time: 3141.144 ms (13 rows)
I believe the reason for this problem is that the only index on name is the joint unique index on owner_uuid and name, in which owner_uuid is listed first. Because owner_uuid is listed before name, the index is sorted primarily by owner_uuid, so to find out if a name does not exist, the entire index must be scanned. As we have nearly 22M collections, this takes a long time.
Adding an index specific to name completely eliminates this problem:
arvados_api_production=# create index "index_collections_on_name" on collections (name); CREATE INDEX arvados_api_production=# explain analyze SELECT collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram')) ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=6.09..6.09 rows=1 width=893) (actual time=0.065..0.065 rows=0 loops=1) -> Sort (cost=6.09..6.09 rows=1 width=893) (actual time=0.064..0.064 rows=0 loops=1) Sort Key: collections.modified_at DESC, collections.uuid Sort Method: quicksort Memory: 25kB -> Nested Loop Anti Join (cost=0.56..6.08 rows=1 width=893) (actual time=0.056..0.056 rows=0 loops=1) Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text) -> Index Scan using index_collections_on_name on collections (cost=0.56..4.58 rows=1 width=893) (actual time=0.055..0.055 rows=0 loops=1) Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text) Filter: (NOT is_trashed) -> Materialize (cost=0.00..1.48 rows=1 width=28) (never executed) -> Seq Scan on materialized_permission_view (cost=0.00..1.48 rows=1 width=28) (never executed) Filter: (trashed = 1) Planning time: 4.513 ms Execution time: 0.131 ms (14 rows)
Adding this index makes a lookup by name 100000x faster in our real-world database.
Updated by Tom Morris about 6 years ago
- Target version deleted (
2018-10-31 sprint)
Updated by Brett Smith almost 2 years ago
- Related to Bug #13972: Listing collections by PDH and name can be very slow added
Updated by Peter Amstutz almost 2 years ago
- Release deleted (
60) - Target version set to Future
I feel like it is highly likely it was fixed but someone should double check.
Updated by Brett Smith almost 2 years ago
Peter Amstutz wrote in #note-5:
I feel like it is highly likely it was fixed but someone should double check.
We have an index on owner_uuid and name, created in 20180919001158_recreate_collection_unique_name_index.rb
as part of #13561, but we don't have an index on just name as suggested in this issue. But maybe it still meets the need? The reported search also checks owner_uuid.
Updated by Peter Amstutz almost 2 years ago
Brett Smith wrote in #note-6:
Peter Amstutz wrote in #note-5:
I feel like it is highly likely it was fixed but someone should double check.
We have an index on owner_uuid and name, created in
20180919001158_recreate_collection_unique_name_index.rb
as part of #13561, but we don't have an index on just name as suggested in this issue. But maybe it still meets the need? The reported search also checks owner_uuid.
I guess the issue here is that if the name doesn't exist at all, the query can terminate much faster. Instead, it is doing a full table scan for a name only to come up empty.
A separate standalone "name" index is probably not a bad idea.
Updated by Peter Amstutz over 1 year ago
- Story points changed from 0.5 to 1.0
- Target version changed from Future to To be scheduled
Should do this for both collections and projects.
Adding a simple index on 'name' is low risk.
Investigate if adding a trigram index makes sense so that 'like' and 'ilike' searches are also efficient.
Updated by Peter Amstutz over 1 year ago
- Target version changed from To be scheduled to Development 2023-04-26 sprint
Updated by Tom Clegg over 1 year ago
Peter Amstutz wrote in #note-9:
Investigate if adding a trigram index makes sense so that 'like' and 'ilike' searches are also efficient.
From https://www.postgresql.org/docs/current/pgtrgm.html
"These index types support the above-described similarity operators, and additionally support trigram-based index searches for LIKE, ILIKE, ~, ~* and = queries. Inequality operators are not supported. Note that those indexes may not be as efficient as regular B-tree indexes for equality operator."
I think the short version is that if we want to speed up name like '%string%'
as well as name = 'string'
, then trigram is better. If we want to optimize for =
and sacrifice un-anchored like
, b-tree is better (smaller index, faster search).
Here's a branch with the trigram option.
14070-name-index @ b4dbb53529d133660403c1761f06cb321cbc70f6 -- developer-run-tests: #3617
Updated by Tom Clegg over 1 year ago
This was reported 2018-08-16 and we added the multi-column trigram indexes 2019-05-23. But a name = 'string'
search on a current database still uses the multi-column (non-trigram) collections_search_index
which is presumably not better than the owner_uuid_and_name
index reported here.
Updated by Tom Clegg over 1 year ago
14070-name-index @ b4dbb53529d133660403c1761f06cb321cbc70f6 -- developer-run-tests: #3619
Updated by Brett Smith over 1 year ago
Tom Clegg wrote in #note-12:
I think the short version is that if we want to speed up
name like '%string%'
as well asname = 'string'
, then trigram is better. If we want to optimize for=
and sacrifice un-anchoredlike
, b-tree is better (smaller index, faster search).Here's a branch with the trigram option.
Without some metrics or other specifics pushing us to prioritize equality performance, I agree improving the performance of more searches seems like the better trade-off. This is good to merge, thanks.
Updated by Tom Clegg over 1 year ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Applied in changeset arvados|604f652987e7200f24eec4e31393238ec3960989.