Project

General

Profile

Actions

Bug #21814

open

Bad performance of "shared with me"

Added by Peter Amstutz about 1 month ago. Updated 13 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
API
Story points:
-

Description

Report from user

[We] are waiting here for more than few minutes waiting for the list to load.
This is making the “Shared with me” unusable.

I looked into this, and it appears that the query doesn't benefit from any indexes and does a full table scan.

  1. We should see if we can add an index to make it perform better
    1. Run explain analyze on the primary "shared with me" query
    2. Identify inner queries that could be indexed and/or if the query could be written differently
  2. We should fix workbench2 to not block page loads on computing "items available" (#17074)

Files

21814Analyze.log (11.3 KB) 21814Analyze.log Brett Smith, 06/03/2024 09:41 PM
21814Rails.log (23.7 KB) 21814Rails.log Brett Smith, 06/03/2024 09:41 PM
21814Queries.sql (7.64 KB) 21814Queries.sql Brett Smith, 06/03/2024 09:41 PM

Subtasks 1 (1 open0 closed)

Task #21825: ReviewNewLucas Di PentimaActions

Related issues

Related to Arvados - Feature #17074: Use count=none & keyset pagingIn ProgressStephen SmithActions
Actions #1

Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz about 1 month ago

Actions #3

Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz about 1 month ago

  • Target version changed from 439 to Development 2024-06-05 sprint
Actions #5

Updated by Peter Amstutz about 1 month ago

  • Assigned To set to Brett Smith
Actions #6

Updated by Brett Smith about 1 month ago

IMO there's a lot of room for improvement in Workbench's queries before we get to the API server. I loaded one page of "Shared With Me" with the network inspector open, saved the HAR file, and parsed the results below.

  • By far the biggest issue is that Workbench makes a single "get user" call for each item in the page, apparently to render the owner name, without any caching or anything. Workbench is apparently grabbing all the container request UUIDs and getting information for them all with a single list request. It would be a big improvement if it could do the same for object owners.
  • When Workbench lists container requests, it selects a lot of fields that aren't obviously necessary to render these page. Some of these are potentially large, like command, environment, output, runtime_constraints, and scheduling_parameters. It would help reduce the response size and rendering time if this list could be slimmed down.
  • When Workbench requests the contents, there are some noop name conditions: ["collections.name","ilike","%%"],["container_requests.name","ilike","%%"],["groups.name","ilike","%%"]. I suspect PostgreSQL is smart enough to ignore these, but it would be a nice clean-up to not include these conditions at all.
% jq -r '.log.entries | map(.request) | map(.method + " " + .url)' 21814WorkbenchSharedWithMe.har | while read; do echo -e "${REPLY//%/\\x}"; done
[
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/groups/contents?limit=50&offset=0&exclude_home_project=true&filters=[["uuid","is_a",["arvados#group","arvados#collection","arvados#workflow","arvados#containerRequest"]],["collections.properties.type","not+in",["log","intermediate"]],["container_requests.requesting_container_uuid","=",null],["collections.name","ilike","%%"],["container_requests.name","ilike","%%"],["groups.name","ilike","%%"],["uuid","!=","jutro-j7d0g-publicfavorites"]]&order=collections.modified_at+desc,container_requests.modified_at+desc,groups.modified_at+desc,container_requests.created_at+desc",
  "GET https://jutro.arvadosapi.com/arvados/v1/groups/contents?limit=50&offset=0&exclude_home_project=true&filters=[["uuid","is_a",["arvados#group","arvados#collection","arvados#workflow","arvados#containerRequest"]],["collections.properties.type","not+in",["log","intermediate"]],["container_requests.requesting_container_uuid","=",null],["collections.name","ilike","%%"],["container_requests.name","ilike","%%"],["groups.name","ilike","%%"],["uuid","!=","jutro-j7d0g-publicfavorites"]]&order=collections.modified_at+desc,container_requests.modified_at+desc,groups.modified_at+desc,container_requests.created_at+desc",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/links",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/links",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/containers?limit=10&filters=[["uuid","in",["jutro-dz642-vllvdqn9etb3jwb","jutro-dz642-sy9v9aik24v4mn5","jutro-dz642-odq6e6qq3e7w1o5","jutro-dz642-a10uraavowzg6jl","jutro-dz642-tvg3x6srvocel5b","jutro-dz642-za85ukyklqcfjp6","jutro-dz642-2suwzy91dc4prka","jutro-dz642-mezaf9gijdepcpw","jutro-dz642-3jpldjssokrtzwv","jutro-dz642-sxe81u4b3761uzj"]]]&select=["auth_uuid",+"command",+"container_image",+"cost",+"created_at",+"cwd",+"environment",+"etag",+"exit_code",+"finished_at",+"gateway_address",+"href",+"interactive_session_started",+"kind",+"lock_count",+"locked_by_uuid",+"log",+"modified_at",+"modified_by_client_uuid",+"modified_by_user_uuid",+"output_path",+"output_properties",+"output_storage_classes",+"output",+"owner_uuid",+"priority",+"progress",+"runtime_auth_scopes",+"runtime_constraints",+"runtime_status",+"runtime_user_uuid",+"scheduling_parameters",+"started_at",+"state",+"subrequests_cost",+"uuid"]",
  "POST https://jutro.arvadosapi.com/arvados/v1/links",
  "POST https://jutro.arvadosapi.com/arvados/v1/links",
  "GET https://jutro.arvadosapi.com/arvados/v1/containers?limit=10&filters=[["uuid","in",["jutro-dz642-vllvdqn9etb3jwb","jutro-dz642-sy9v9aik24v4mn5","jutro-dz642-odq6e6qq3e7w1o5","jutro-dz642-a10uraavowzg6jl","jutro-dz642-tvg3x6srvocel5b","jutro-dz642-za85ukyklqcfjp6","jutro-dz642-2suwzy91dc4prka","jutro-dz642-mezaf9gijdepcpw","jutro-dz642-3jpldjssokrtzwv","jutro-dz642-sxe81u4b3761uzj"]]]&select=["auth_uuid",+"command",+"container_image",+"cost",+"created_at",+"cwd",+"environment",+"etag",+"exit_code",+"finished_at",+"gateway_address",+"href",+"interactive_session_started",+"kind",+"lock_count",+"locked_by_uuid",+"log",+"modified_at",+"modified_by_client_uuid",+"modified_by_user_uuid",+"output_path",+"output_properties",+"output_storage_classes",+"output",+"owner_uuid",+"priority",+"progress",+"runtime_auth_scopes",+"runtime_constraints",+"runtime_status",+"runtime_user_uuid",+"scheduling_parameters",+"started_at",+"state",+"subrequests_cost",+"uuid"]",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-59v8ow4jlft20wf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-kd1gw1tlhh7vhb7",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-59v8ow4jlft20wf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-kd1gw1tlhh7vhb7",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-t7t716xn65udys3",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-kd1gw1tlhh7vhb7",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-kd1gw1tlhh7vhb7",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-t7t716xn65udys3",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl" 
]
Actions #7

Updated by Peter Amstutz about 1 month ago

You are right that there's a bunch of redundant or unnecessary work happening. However the core issue is that the primary query (the contents endpoint with exclude_home_project=true) is taking 30-90 seconds on some clusters whereas those other issues mostly just add a few extra seconds (they are proportional to the page size not the total result size).

That said, it is possible that "ilike %" isn't properly optimized out, which would cause it to do a table scan. It would be a good idea, on the API server side, to discard any queries for like % and ilike % for any number of consecutive % and ? characters (as long as at least one '' is present).

Actions #8

Updated by Brett Smith 29 days ago

Attached are:

  • The Rails log of all the queries involved in serving /groups/contents.
  • An SQL file to show the query plan for each distinct query in there.
  • The results of running psql --no-psqlrc --file 21814Queries.sql --host localhost arvados_test arvados against a test server database (this is important, see below).

I notice this stanza repeats across a lot of the queries:

        SubPlan 1
          ->  Seq Scan on trashed_groups  (cost=0.00..28.00 rows=400 width=32)
                Filter: (trash_at <= statement_timestamp())
        SubPlan 2
          ->  Seq Scan on users  (cost=0.00..3.36 rows=28 width=28)
                Filter: ((uuid)::text <> 'zzzzz-tpzed-d9tiejq69daie8f'::text)
        SubPlan 4
          ->  Seq Scan on groups gp  (cost=0.00..3.54 rows=16 width=32)
                Filter: ((group_class)::text <> 'project'::text)

users.uuid and groups.group_class have indexes. I assume the reason they're sequential scanning is because the small test database means this caveat from the PostgreSQL documentation applies (emphasis added):

The planner's cost estimates are not linear and so it might choose a different plan for a larger or smaller table. An extreme example is that on a table that only occupies one disk page, you'll nearly always get a sequential scan plan whether indexes are available or not. The planner realizes that it's going to take one disk page read to process the table in any case, so there's no value in expending additional page reads to look at an index.

This may apply to some of the higher-level sequential scans as well. That seems harder to tell while the subplans do sequential scans.

I think to get better data, it would be good to run this SQL on a real database and get the results. Ideally the affected user's. If that's not possible, the largest Curii cluster.

I am fairly confident an index on trashed_groups.trashed_at would help. That doesn't seem to exist, and because it gets queried repeatedly, it causes repeated sequential scans.

Actions #9

Updated by Peter Amstutz 28 days ago

  • Status changed from New to In Progress
Actions #12

Updated by Peter Amstutz 27 days ago

  • Target version changed from Development 2024-06-05 sprint to Development 2024-06-19 sprint
Actions #13

Updated by Brett Smith 26 days ago

Running the plan explainer on a real database shows that just doing #17074 will improve the Workbench's performance by about two orders of magnitude. Most of the work happens when retrieving collections. This query is pretty complicated and ends up looking at most of the rows. We can improve that on the API server end, but the limits built into the query means it becomes a much, much bigger problem when doing the count. Here's the "main" query to fetch collections:

EXPLAIN SELECT collections."uuid", collections."owner_uuid", collections."created_at",  collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed", collections."version", collections."current_version_uuid", collections."preserve_version", collections."file_count", collections."file_size_total" FROM "collections" WHERE (NOT (((collections.owner_uuid IN (SELECT group_uuid FROM trashed_groups WHERE trash_at <= statement_timestamp())) OR collections.trash_at <= statement_timestamp() IS TRUE)) AND collections.uuid = collections.current_version_uuid) AND (collections.owner_uuid IN (SELECT users.uuid FROM users WHERE users.uuid != ('zzzzz-tpzed-user1user1user1')) OR EXISTS(SELECT 1 FROM groups as gp where gp.uuid=collections.owner_uuid and gp.group_class != 'project')) AND ((1=0 OR 1=1 OR 1=0 OR 1=0) AND (collections.properties->>'type' NOT IN ('log','intermediate') OR collections.properties->>'type' IS NULL) AND (collections.name ilike '%%') AND (collections.uuid <> 'zzzzz-j7d0g-publicfavorites' OR collections.uuid IS NULL)) ORDER BY collections.modified_at desc LIMIT 9 OFFSET 0;

QUERY PLAN
Limit  (cost=103.90..12317.51 rows=9 width=499)
  ->  Index Scan Backward using index_collections_on_modified_at_and_uuid on collections  (cost=103.90..7865666.06 rows=5796 width=499)
        Filter: ((NOT (hashed SubPlan 1)) AND ((name)::text ~~* '%%'::text) AND (((uuid)::text <> 'zzzzz-j7d0g-publicfavorites'::text) OR (uuid IS NULL)) AND ((uuid)::text = (current_version_uuid)::text) AND ((trash_at <= statement_timestamp()) IS NOT TRUE) AND (((properties ->> 'type'::text) <> ALL ('{log,intermediate}'::text[])) OR ((properties ->> 'type'::text) IS NULL)) AND ((hashed SubPlan 2) OR (hashed SubPlan 4)))
        SubPlan 1
          ->  Seq Scan on trashed_groups  (cost=0.00..17.61 rows=707 width=28)
                Filter: (trash_at <= statement_timestamp())
        SubPlan 2
          ->  Index Only Scan using index_users_on_uuid on users  (cost=0.28..76.50 rows=3040 width=28)
                Filter: ((uuid)::text <> 'zzzzz-tpzed-user1user1user1'::text)
        SubPlan 4
          ->  Seq Scan on groups gp  (cost=0.00..1314.24 rows=182 width=32)
                Filter: ((group_class)::text <> 'project'::text)

Note that the index scan has a huge total cost, but it's capped by the parent limit node. When you do the count, you lift the limit and have to pay the entire cost:

EXPLAIN SELECT COUNT(*) FROM "collections" WHERE (NOT (((collections.owner_uuid IN (SELECT group_uuid FROM trashed_groups WHERE trash_at <= statement_timestamp())) OR collections.trash_at <= statement_timestamp() IS TRUE)) AND collections.uuid = collections.current_version_uuid) AND (collections.owner_uuid IN (SELECT users.uuid FROM users WHERE users.uuid != ('zzzzz-tpzed-user1user1user1')) OR EXISTS(SELECT 1 FROM groups as gp where gp.uuid=collections.owner_uuid and gp.group_class != 'project')) AND ((1=0 OR 1=1 OR 1=0 OR 1=0) AND (collections.properties->>'type' NOT IN ('log','intermediate') OR collections.properties->>'type' IS NULL) AND (collections.name ilike '%%') AND (collections.uuid <> 'zzzzz-j7d0g-publicfavorites' OR collections.uuid IS NULL));

QUERY PLAN
Aggregate  (cost=7826420.24..7826420.25 rows=1 width=8)
  ->  Seq Scan on collections  (cost=103.47..7826405.75 rows=5796 width=0)
        Filter: ((NOT (hashed SubPlan 1)) AND ((name)::text ~~* '%%'::text) AND (((uuid)::text <> 'zzzzz-j7d0g-publicfavorites'::text) OR (uuid IS NULL)) AND ((uuid)::text = (current_version_uuid)::text) AND ((trash_at <= statement_timestamp()) IS NOT TRUE) AND (((properties ->> 'type'::text) <> ALL ('{log,intermediate}'::text[])) OR ((properties ->> 'type'::text) IS NULL)) AND ((hashed SubPlan 2) OR (hashed SubPlan 4)))
        SubPlan 1
          ->  Seq Scan on trashed_groups  (cost=0.00..17.61 rows=707 width=28)
                Filter: (trash_at <= statement_timestamp())
        SubPlan 2
          ->  Index Only Scan using index_users_on_uuid on users  (cost=0.28..76.50 rows=3040 width=28)
                Filter: ((uuid)::text <> 'zzzzz-tpzed-user1user1user1'::text)
        SubPlan 4
          ->  Seq Scan on groups gp  (cost=0.00..1314.24 rows=182 width=32)
                Filter: ((group_class)::text <> 'project'::text)

On this database, just this one query is two orders of magnitude more expensive than any other run by a /groups/contents request.

Actions #14

Updated by Peter Amstutz 13 days ago

  • Target version changed from Development 2024-06-19 sprint to Development 2024-07-03 sprint
Actions

Also available in: Atom PDF