Project

General

Profile

Actions

Bug #14070

closed

API DB needs an index on collections name

Added by Joshua Randall over 5 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Story points:
1.0
Release relationship:
Auto

Description

Our client code that uploads data into keep looks up collections by name to see if they already exist. These queries can be very slow, in particular when the name does not exist.

The query plan for this ends up looking like this:

arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=708289.84..708289.85 rows=1 width=893) (actual time=419589.665..419589.665 rows=0 loops=1)
   ->  Sort  (cost=708289.84..708289.85 rows=1 width=893) (actual time=419589.663..419589.663 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.69..708289.83 rows=1 width=893) (actual time=419589.652..419589.652 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_owner_uuid_and_name on collections  (cost=0.69..708288.34 rows=1 width=893) (actual time=419589.651..419589.651 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 0.560 ms
 Execution time: 419589.734 ms
(13 rows)

Once the index is cached in memory it is significantly faster, but still slow:

arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=708210.57..708210.58 rows=1 width=893) (actual time=3141.044..3141.044 rows=0 loops=1)
   ->  Sort  (cost=708210.57..708210.58 rows=1 width=893) (actual time=3141.043..3141.043 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.69..708210.56 rows=1 width=893) (actual time=3141.035..3141.035 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_owner_uuid_and_name on collections  (cost=0.69..708209.07 rows=1 width=893) (actual time=3141.034..3141.034 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 1.376 ms
 Execution time: 3141.144 ms
(13 rows)

I believe the reason for this problem is that the only index on name is the joint unique index on owner_uuid and name, in which owner_uuid is listed first. Because owner_uuid is listed before name, the index is sorted primarily by owner_uuid, so to find out if a name does not exist, the entire index must be scanned. As we have nearly 22M collections, this takes a long time.

Adding an index specific to name completely eliminates this problem:

arvados_api_production=# create index "index_collections_on_name" on collections (name);
CREATE INDEX
arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6.09..6.09 rows=1 width=893) (actual time=0.065..0.065 rows=0 loops=1)
   ->  Sort  (cost=6.09..6.09 rows=1 width=893) (actual time=0.064..0.064 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.56..6.08 rows=1 width=893) (actual time=0.056..0.056 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_name on collections  (cost=0.56..4.58 rows=1 width=893) (actual time=0.055..0.055 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
                     Filter: (NOT is_trashed)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 4.513 ms
 Execution time: 0.131 ms
(14 rows)

Adding this index makes a lookup by name 100000x faster in our real-world database.


Subtasks 1 (0 open1 closed)

Task #20323: ReviewResolvedTom Clegg04/24/2023Actions

Related issues

Related to Arvados - Bug #13972: Listing collections by PDH and name can be very slowNewActions
Actions #1

Updated by Tom Morris over 5 years ago

  • Target version set to 2018-10-31 sprint
Actions #2

Updated by Tom Morris over 5 years ago

  • Target version deleted (2018-10-31 sprint)
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Release set to 60
Actions #4

Updated by Brett Smith about 1 year ago

  • Related to Bug #13972: Listing collections by PDH and name can be very slow added
Actions #5

Updated by Peter Amstutz about 1 year ago

  • Release deleted (60)
  • Target version set to Future

I feel like it is highly likely it was fixed but someone should double check.

Actions #6

Updated by Brett Smith about 1 year ago

Peter Amstutz wrote in #note-5:

I feel like it is highly likely it was fixed but someone should double check.

We have an index on owner_uuid and name, created in 20180919001158_recreate_collection_unique_name_index.rb as part of #13561, but we don't have an index on just name as suggested in this issue. But maybe it still meets the need? The reported search also checks owner_uuid.

Actions #7

Updated by Brett Smith about 1 year ago

  • Story points set to 0.5
Actions #8

Updated by Peter Amstutz about 1 year ago

Brett Smith wrote in #note-6:

Peter Amstutz wrote in #note-5:

I feel like it is highly likely it was fixed but someone should double check.

We have an index on owner_uuid and name, created in 20180919001158_recreate_collection_unique_name_index.rb as part of #13561, but we don't have an index on just name as suggested in this issue. But maybe it still meets the need? The reported search also checks owner_uuid.

I guess the issue here is that if the name doesn't exist at all, the query can terminate much faster. Instead, it is doing a full table scan for a name only to come up empty.

A separate standalone "name" index is probably not a bad idea.

Actions #9

Updated by Peter Amstutz about 1 year ago

  • Story points changed from 0.5 to 1.0
  • Target version changed from Future to To be scheduled

Should do this for both collections and projects.

Adding a simple index on 'name' is low risk.

Investigate if adding a trigram index makes sense so that 'like' and 'ilike' searches are also efficient.

Actions #10

Updated by Peter Amstutz about 1 year ago

  • Target version changed from To be scheduled to Development 2023-04-26 sprint
Actions #11

Updated by Tom Clegg about 1 year ago

  • Assigned To set to Tom Clegg
Actions #12

Updated by Tom Clegg 12 months ago

Peter Amstutz wrote in #note-9:

Investigate if adding a trigram index makes sense so that 'like' and 'ilike' searches are also efficient.

From https://www.postgresql.org/docs/current/pgtrgm.html

"These index types support the above-described similarity operators, and additionally support trigram-based index searches for LIKE, ILIKE, ~, ~* and = queries. Inequality operators are not supported. Note that those indexes may not be as efficient as regular B-tree indexes for equality operator."

I think the short version is that if we want to speed up name like '%string%' as well as name = 'string', then trigram is better. If we want to optimize for = and sacrifice un-anchored like, b-tree is better (smaller index, faster search).

Here's a branch with the trigram option.

14070-name-index @ b4dbb53529d133660403c1761f06cb321cbc70f6 -- developer-run-tests: #3617

Actions #13

Updated by Tom Clegg 12 months ago

  • Status changed from New to In Progress
Actions #14

Updated by Tom Clegg 12 months ago

This was reported 2018-08-16 and we added the multi-column trigram indexes 2019-05-23. But a name = 'string' search on a current database still uses the multi-column (non-trigram) collections_search_index which is presumably not better than the owner_uuid_and_name index reported here.

Actions #16

Updated by Brett Smith 12 months ago

Tom Clegg wrote in #note-12:

I think the short version is that if we want to speed up name like '%string%' as well as name = 'string', then trigram is better. If we want to optimize for = and sacrifice un-anchored like, b-tree is better (smaller index, faster search).

Here's a branch with the trigram option.

Without some metrics or other specifics pushing us to prioritize equality performance, I agree improving the performance of more searches seems like the better trade-off. This is good to merge, thanks.

Actions #17

Updated by Tom Clegg 12 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #18

Updated by Peter Amstutz 8 months ago

  • Release set to 66
Actions

Also available in: Atom PDF