Project

General

Profile

Actions

Bug #14070

open

API DB needs an index on collections name

Added by Joshua Randall over 4 years ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
0.5

Description

Our client code that uploads data into keep looks up collections by name to see if they already exist. These queries can be very slow, in particular when the name does not exist.

The query plan for this ends up looking like this:

arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=708289.84..708289.85 rows=1 width=893) (actual time=419589.665..419589.665 rows=0 loops=1)
   ->  Sort  (cost=708289.84..708289.85 rows=1 width=893) (actual time=419589.663..419589.663 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.69..708289.83 rows=1 width=893) (actual time=419589.652..419589.652 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_owner_uuid_and_name on collections  (cost=0.69..708288.34 rows=1 width=893) (actual time=419589.651..419589.651 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 0.560 ms
 Execution time: 419589.734 ms
(13 rows)

Once the index is cached in memory it is significantly faster, but still slow:

arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=708210.57..708210.58 rows=1 width=893) (actual time=3141.044..3141.044 rows=0 loops=1)
   ->  Sort  (cost=708210.57..708210.58 rows=1 width=893) (actual time=3141.043..3141.043 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.69..708210.56 rows=1 width=893) (actual time=3141.035..3141.035 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_owner_uuid_and_name on collections  (cost=0.69..708209.07 rows=1 width=893) (actual time=3141.034..3141.034 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 1.376 ms
 Execution time: 3141.144 ms
(13 rows)

I believe the reason for this problem is that the only index on name is the joint unique index on owner_uuid and name, in which owner_uuid is listed first. Because owner_uuid is listed before name, the index is sorted primarily by owner_uuid, so to find out if a name does not exist, the entire index must be scanned. As we have nearly 22M collections, this takes a long time.

Adding an index specific to name completely eliminates this problem:

arvados_api_production=# create index "index_collections_on_name" on collections (name);
CREATE INDEX
arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.name = 'irods://seq/26250/26250_3#1.cram'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 100 OFFSET 0;
                                                                        QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6.09..6.09 rows=1 width=893) (actual time=0.065..0.065 rows=0 loops=1)
   ->  Sort  (cost=6.09..6.09 rows=1 width=893) (actual time=0.064..0.064 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=0.56..6.08 rows=1 width=893) (actual time=0.056..0.056 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Index Scan using index_collections_on_name on collections  (cost=0.56..4.58 rows=1 width=893) (actual time=0.055..0.055 rows=0 loops=1)
                     Index Cond: ((name)::text = 'irods://seq/26250/26250_3#1.cram'::text)
                     Filter: (NOT is_trashed)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 4.513 ms
 Execution time: 0.131 ms
(14 rows)

Adding this index makes a lookup by name 100000x faster in our real-world database.


Related issues

Related to Arvados - Bug #13972: Listing collections by PDH and name can be very slowNew

Actions
Actions #1

Updated by Tom Morris over 4 years ago

  • Target version set to 2018-10-31 sprint
Actions #2

Updated by Tom Morris over 4 years ago

  • Target version deleted (2018-10-31 sprint)
Actions #3

Updated by Peter Amstutz about 2 months ago

  • Release set to 60
Actions #4

Updated by Brett Smith about 2 months ago

  • Related to Bug #13972: Listing collections by PDH and name can be very slow added
Actions #5

Updated by Peter Amstutz about 1 month ago

  • Release deleted (60)
  • Target version set to To be groomed

I feel like it is highly likely it was fixed but someone should double check.

Actions #6

Updated by Brett Smith about 1 month ago

Peter Amstutz wrote in #note-5:

I feel like it is highly likely it was fixed but someone should double check.

We have an index on owner_uuid and name, created in 20180919001158_recreate_collection_unique_name_index.rb as part of #13561, but we don't have an index on just name as suggested in this issue. But maybe it still meets the need? The reported search also checks owner_uuid.

Actions #7

Updated by Brett Smith about 1 month ago

  • Story points set to 0.5
Actions #8

Updated by Peter Amstutz about 1 month ago

Brett Smith wrote in #note-6:

Peter Amstutz wrote in #note-5:

I feel like it is highly likely it was fixed but someone should double check.

We have an index on owner_uuid and name, created in 20180919001158_recreate_collection_unique_name_index.rb as part of #13561, but we don't have an index on just name as suggested in this issue. But maybe it still meets the need? The reported search also checks owner_uuid.

I guess the issue here is that if the name doesn't exist at all, the query can terminate much faster. Instead, it is doing a full table scan for a name only to come up empty.

A separate standalone "name" index is probably not a bad idea.

Actions

Also available in: Atom PDF