Bug #8289
closed[API] Avoid making queries unnecessarily slow by adding superfluous "order by" columns
Description
As an example, when the Python SDK's websocket fallback PollClient starts up, it does an API request with order="id asc". API server translates that to a PostgreSQL query like this:
arvados_production=# SELECT "logs".* FROM "logs" WHERE ((logs.event_type in ('create','update','delete')) AND (logs.id > '3464274')) ORDER BY logs.id desc, logs.modified_at desc, logs.uuid LIMIT 1 OFFSET 0; arvados_production=# explain analyze SELECT "logs".* FROM "logs" WHERE ((logs.event_type in ('create','update','delete')) AND (logs.id > '3464274')) ORDER BY logs.id desc, logs.modified_at desc, logs.uuid LIMIT 1 OFFSET 0; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=822081.89..822081.89 rows=1 width=1290) (actual time=4007.367..4007.368 rows=1 loops=1) -> Sort (cost=822081.89..823218.65 rows=454703 width=1290) (actual time=4007.364..4007.364 rows=1 loops=1) Sort Key: id, modified_at, uuid Sort Method: top-N heapsort Memory: 27kB -> Bitmap Heap Scan on logs (cost=69925.24..819808.37 rows=454703 width=1290) (actual time=531.924..3086.689 rows=869534 loops=1) Recheck Cond: ((id > 3464274) AND ((event_type)::text = ANY ('{create,update,delete}'::text[]))) -> BitmapAnd (cost=69925.24..69925.24 rows=454703 width=0) (actual time=530.935..530.935 rows=0 loops=1) -> Bitmap Index Scan on logs_pkey (cost=0.00..29223.51 rows=1433428 width=0) (actual time=249.940..249.940 rows=1954308 loops=1) Index Cond: (id > 3464274) -> Bitmap Index Scan on index_logs_on_event_type (cost=0.00..40474.13 rows=1341921 width=0) (actual time=270.632..270.632 rows=1735755 loops=1) Index Cond: ((event_type)::text = ANY ('{create,update,delete}'::text[])) Total runtime: 4007.417 ms
"id" is a primary key, so the "logs.modified_at desc, logs.uuid" part of the "order by" clause can't possibly make a difference to the results. If we remove them, the query looks like this:
arvados_production=# explain analyze SELECT "logs".* FROM "logs" WHERE ((logs.event_type in ('create','update','delete')) AND (logs.id > '3464274')) ORDER BY logs.id desc LIMIT 1 OFFSET 0; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.00..8.67 rows=1 width=1290) (actual time=0.049..0.050 rows=1 loops=1) -> Index Scan Backward using logs_pkey on logs (cost=0.00..3943415.80 rows=454703 width=1290) (actual time=0.046..0.046 rows=1 loops=1) Index Cond: (id > 3464274) Filter: ((event_type)::text = ANY ('{create,update,delete}'::text[])) Total runtime: 0.078 ms
Updated by Tom Clegg almost 9 years ago
- Description updated (diff)
- Category set to API
Updated by Brett Smith almost 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-02-17 Sprint
Updated by Radhika Chippada almost 9 years ago
Review comments:
I experimented with a few more combinations of order in the query_test and I am not sure if we need more tweaking here.
- If I gave order: ['id asc', 'modified_at asc’], I am ending up with 'logs.id asc, logs.modified_at asc’. Do we want to get rid off the modified_at in this case to ensure clients do not send meaningless yet costly queries?
- If I gave order: ['modified_at asc'], I am getting ‘logs.modified_at asc, logs.modified_at desc, logs.uuid’
- If I used order: ['event_type asc', 'modified_at asc’], I am getting "logs.event_type asc", "logs.modified_at asc", "logs.modified_at desc", "logs.uuid”. In this and the previous case, shouldn’t we get rid off the modified_at before adding the full default list? Or better yet, add only those from the default list that are not specified?
Updated by Tom Clegg almost 9 years ago
Good suggestions, thanks. Updated code and added tests; now at 6f1d385
Updated by Radhika Chippada almost 9 years ago
This is much better. Can we update the comment a bit more. Something like: if id and updated_at are given, updated_at is truncated because after the unique id column ordering, no need to order on updated_at. On the other hand, if log_type and id are given ordering on type and then id is what is happening. And then if I used log_type, id, updated_at, we will use only log_type and id and truncate updated_at since there is no point in using it.
With or without it LGTM. Thanks.
Updated by Tom Clegg almost 9 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:a13c14cfc7fabe4f4da48edd57a086d2d8953a03.