Project

General

Profile

Actions

Bug #14070

closed

API DB needs an index on collections name

Added by Joshua Randall over 5 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Story points:
1.0
Release relationship:
Auto

Description

Our client code that uploads data into keep looks up collections by name to see if they already exist. These queries can be very slow, in particular when the name does not exist.

The query plan for this ends up looking like this:

arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=708289.84..708289.85 rows=1 width=893) (actual time=419589.665..419589.665 rows=0 loops=1)
   ->  Sort  (cost=708289.84..708289.85 rows=1 width=893) (actual time=419589.663..419589.663 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.69..708289.83 rows=1 width=893) (actual time=419589.652..419589.652 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_owner_uuid_and_name on collections  (cost=0.69..708288.34 rows=1 width=893) (actual time=419589.651..419589.651 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 0.560 ms
 Execution time: 419589.734 ms
(13 rows)

Once the index is cached in memory it is significantly faster, but still slow:

arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=708210.57..708210.58 rows=1 width=893) (actual time=3141.044..3141.044 rows=0 loops=1)
   ->  Sort  (cost=708210.57..708210.58 rows=1 width=893) (actual time=3141.043..3141.043 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.69..708210.56 rows=1 width=893) (actual time=3141.035..3141.035 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_owner_uuid_and_name on collections  (cost=0.69..708209.07 rows=1 width=893) (actual time=3141.034..3141.034 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 1.376 ms
 Execution time: 3141.144 ms
(13 rows)

I believe the reason for this problem is that the only index on name is the joint unique index on owner_uuid and name, in which owner_uuid is listed first. Because owner_uuid is listed before name, the index is sorted primarily by owner_uuid, so to find out if a name does not exist, the entire index must be scanned. As we have nearly 22M collections, this takes a long time.

Adding an index specific to name completely eliminates this problem:

arvados_api_production=# create index "index_collections_on_name" on collections (name);
CREATE INDEX
arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6.09..6.09 rows=1 width=893) (actual time=0.065..0.065 rows=0 loops=1)
   ->  Sort  (cost=6.09..6.09 rows=1 width=893) (actual time=0.064..0.064 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.56..6.08 rows=1 width=893) (actual time=0.056..0.056 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_name on collections  (cost=0.56..4.58 rows=1 width=893) (actual time=0.055..0.055 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
                     Filter: (NOT is_trashed)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 4.513 ms
 Execution time: 0.131 ms
(14 rows)

Adding this index makes a lookup by name 100000x faster in our real-world database.


Subtasks 1 (0 open1 closed)

Task #20323: ReviewResolvedTom Clegg04/24/2023Actions

Related issues

Related to Arvados - Bug #13972: Listing collections by PDH and name can be very slowNewActions
Actions

Also available in: Atom PDF