Bug #13972

Listing collections by PDH and name can be very slow

Added by Joshua Randall 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
API
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

Requests to list collections by PDH and name (I think these are coming from a-c-r) sometimes take a very long time (90s+):

{
  "method": "GET",
  "path": "/arvados/v1/collections",
  "format": "json",
  "controller": "Arvados::V1::CollectionsController",
  "action": "index",
  "status": 200,
  "duration": 104537.05,
  "view": 0.33,
  "db": 104532.55,
  "request_id": "req-mgxr3wrwt09hytd2jjsc",
  "client_ipaddr": "10.101.0.7",
  "client_auth": "eglyx-gj3su-u157eevzet2a0gt",
  "params": {
    "alt": "json",
    "limit": "1",
    "filters": "[[\"portable_data_hash\", \"=\", \"d41d8cd98f00b204e9800998ecf8427e+0\"], [\"name\", \"like\", \"gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%\"]]" 
  },
  "@timestamp": "2018-08-06T17:46:29.775676929Z",
  "@version": "1",
  "message": "[200] GET /arvados/v1/collections (Arvados::V1::CollectionsController#index)" 
}

root@arvados-api-backend-eglyx-01:/var/www/arvados-api/current/log# grep req-mgxr3wrwt09hytd2jjsc production.log
[req-mgxr3wrwt09hytd2jjsc]   Collection Load (49993.9ms)  SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.portable_data_hash = 'd41d8cd98f00b204e9800998ecf8427e+0') AND (collections.name like 'gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 1 OFFSET 0
[req-mgxr3wrwt09hytd2jjsc]    (54538.7ms)  SELECT COUNT("collections"."id") FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.portable_data_hash = 'd41d8cd98f00b204e9800998ecf8427e+0') AND (collections.name like 'gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%'))
{"method":"GET","path":"/arvados/v1/collections","format":"json","controller":"Arvados::V1::CollectionsController","action":"index","status":200,"duration":104537.05,"view":0.33,"db":104532.55,"request_id":"req-mgxr3wrwt09hytd2jjsc","client_ipaddr":"10.101.0.7","client_auth":"eglyx-gj3su-u157eevzet2a0gt","params":{"alt":"json","limit":"1","filters":"[[\"portable_data_hash\", \"=\", \"d41d8cd98f00b204e9800998ecf8427e+0\"], [\"name\", \"like\", \"gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%\"]]"},"@timestamp":"2018-08-06T17:46:29.775676929Z","@version":"1","message":"[200] GET /arvados/v1/collections (Arvados::V1::CollectionsController#index)"}

Here is how the database explains it:

arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.portable_data_hash = 'd41d8cd98f00b204e9800998ecf8427e+0') AND (collections.name like 'gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 1 OFFSET 0;
                                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=110280.76..110280.76 rows=1 width=894) (actual time=33782.855..33782.855 rows=0 loops=1)
   ->  Sort  (cost=110280.76..110280.77 rows=4 width=894) (actual time=33782.855..33782.855 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=1590.26..110280.74 rows=4 width=894) (actual time=33782.847..33782.847 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Bitmap Heap Scan on collections  (cost=1590.26..110279.19 rows=5 width=894) (actual time=33782.846..33782.846 rows=0 loops=1)
                     Recheck Cond: ((portable_data_hash)::text = 'd41d8cd98f00b204e9800998ecf8427e+0'::text)
                     Filter: ((NOT is_trashed) AND ((name)::text ~~ 'gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%'::text))
                     Rows Removed by Filter: 65732
                     Heap Blocks: exact=46329
                     ->  Bitmap Index Scan on index_collections_on_portable_data_hash  (cost=0.00..1590.26 rows=61293 width=0) (actual time=352.125..352.125 rows=65732 loops=1)
                           Index Cond: ((portable_data_hash)::text = 'd41d8cd98f00b204e9800998ecf8427e+0'::text)
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 18.993 ms
 Execution time: 33782.934 ms
(18 rows)

arvados_api_production=# explain analyze SELECT COUNT("collections"."id") FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.portable_data_hash = 'd41d8cd98f00b204e9800998ecf8427e+0') AND (collections.name like 'gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%'));
                                                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=110280.74..110280.75 rows=1 width=4) (actual time=36774.915..36774.915 rows=1 loops=1)
   ->  Hash Anti Join  (cost=1591.75..110280.73 rows=4 width=4) (actual time=36774.911..36774.911 rows=0 loops=1)
         Hash Cond: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
         ->  Bitmap Heap Scan on collections  (cost=1590.26..110279.19 rows=5 width=32) (actual time=36774.909..36774.909 rows=0 loops=1)
               Recheck Cond: ((portable_data_hash)::text = 'd41d8cd98f00b204e9800998ecf8427e+0'::text)
               Filter: ((NOT is_trashed) AND ((name)::text ~~ 'gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%'::text))
               Rows Removed by Filter: 65732
               Heap Blocks: exact=46329
               ->  Bitmap Index Scan on index_collections_on_portable_data_hash  (cost=0.00..1590.26 rows=61293 width=0) (actual time=580.662..580.662 rows=65732 loops=1)
                     Index Cond: ((portable_data_hash)::text = 'd41d8cd98f00b204e9800998ecf8427e+0'::text)
         ->  Hash  (cost=1.48..1.48 rows=1 width=28) (never executed)
               ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     Filter: (trashed = 1)
 Planning time: 0.388 ms
 Execution time: 36775.002 ms
(15 rows)

History

#1 Updated by Joshua Randall 4 months ago

arvados_api_production=# select count(*) from collections where portable_data_hash = 'd41d8cd98f00b204e9800998ecf8427e+0';
 count
-------
 65732
(1 row)

arvados_api_production=# select count(*) from collections where portable_data_hash = 'd41d8cd98f00b204e9800998ecf8427e+0' and is_trashed = false;
 count
-------
 17789
(1 row)

#2 Updated by Joshua Randall 4 months ago

I added another index and this query got 10x faster:

arvados_api_production=# \d index_collections_on_portable_data_hash_is_trashed
Index "public.index_collections_on_portable_data_hash_is_trashed" 
       Column       |          Type          |     Definition
--------------------+------------------------+--------------------
 portable_data_hash | character varying(255) | portable_data_hash
 is_trashed         | boolean                | is_trashed
btree, for table "public.collections" 

arvados_api_production=# explain analyze SELECT  collections."uuid", collections."owner_uuid", collections."created_at", collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed" FROM "collections" WHERE (NOT EXISTS(SELECT 1 FROM materialized_permission_view WHERE trashed = 1 AND (collections.owner_uuid = target_uuid)) AND collections.is_trashed = false) AND ((collections.portable_data_hash = 'd41d8cd98f00b204e9800998ecf8427e+0') AND (collections.name like 'gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%'))  ORDER BY collections.modified_at desc, collections.uuid LIMIT 1 OFFSET 0;
                                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=93374.56..93374.56 rows=1 width=894) (actual time=2944.982..2944.982 rows=0 loops=1)
   ->  Sort  (cost=93374.56..93374.57 rows=4 width=894) (actual time=2944.980..2944.980 rows=0 loops=1)
         Sort Key: collections.modified_at DESC, collections.uuid
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop Anti Join  (cost=1255.44..93374.54 rows=4 width=894) (actual time=2944.970..2944.970 rows=0 loops=1)
               Join Filter: ((collections.owner_uuid)::text = (materialized_permission_view.target_uuid)::text)
               ->  Bitmap Heap Scan on collections  (cost=1255.44..93372.99 rows=5 width=894) (actual time=2944.969..2944.969 rows=0 loops=1)
                     Recheck Cond: ((portable_data_hash)::text = 'd41d8cd98f00b204e9800998ecf8427e+0'::text)
                     Filter: ((NOT is_trashed) AND ((name)::text ~~ 'gatk-4.0.0.0-library-cram-to-gvcfs.cwl dependencies%'::text))
                     Rows Removed by Filter: 17789
                     Heap Blocks: exact=13978
                     ->  Bitmap Index Scan on index_collections_on_portable_data_hash_is_trashed  (cost=0.00..1255.44 rows=51288 width=0) (actual time=21.035..21.035 rows=17789 loops=1)
                           Index Cond: (((portable_data_hash)::text = 'd41d8cd98f00b204e9800998ecf8427e+0'::text) AND (is_trashed = false))
               ->  Materialize  (cost=0.00..1.48 rows=1 width=28) (never executed)
                     ->  Seq Scan on materialized_permission_view  (cost=0.00..1.48 rows=1 width=28) (never executed)
                           Filter: (trashed = 1)
 Planning time: 0.704 ms
 Execution time: 2945.049 ms
(18 rows)

#3 Updated by Tom Morris 4 months ago

  • Description updated (diff)
  • Target version set to To Be Groomed

I updated the original ticket with the contents of Note-1. I'll delete that note too.

Another thing to note here is that the representation of a 16 byte MD5 hash as a 32 character hexadecimal string has much lower information density than its original representation.

#4 Updated by Tom Morris 4 months ago

It looks like the Postgres UUID datatype with a B-tree index, may be a more efficient way to go for storing and indexing MD5s:
https://dba.stackexchange.com/questions/115271/what-is-the-optimal-data-type-for-an-md5-field
https://dba.stackexchange.com/questions/102448/how-should-i-index-a-uuid-in-postgres

Also, I'm wondering if the separate COUNT query is really necessary. I guess if both queries are made fast, it's less of an issue, but it makes me suspicious of the anti-pattern of counting things that don't really need counting.

Also available in: Atom PDF