Project

General

Profile

Actions

Bug #21814

closed

How to efficiently page group contents queries of more than one object type (e.g. "shared with me")

Added by Peter Amstutz 7 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Story points:
-
Release:
Release relationship:
Auto

Description

Report from user

[We] are waiting here for more than few minutes waiting for the list to load.
This is making the “Shared with me” unusable.

I looked into this, and it appears that the query doesn't benefit from any indexes and does a full table scan.

  1. We should see if we can add an index to make it perform better
    1. Run explain analyze on the primary "shared with me" query
    2. Identify inner queries that could be indexed and/or if the query could be written differently
  2. We should fix workbench2 to not block page loads on computing "items available" (#17074)

Files

21814Analyze.log (11.3 KB) 21814Analyze.log Brett Smith, 06/03/2024 09:41 PM
21814Rails.log (23.7 KB) 21814Rails.log Brett Smith, 06/03/2024 09:41 PM
21814Queries.sql (7.64 KB) 21814Queries.sql Brett Smith, 06/03/2024 09:41 PM

Subtasks 1 (0 open1 closed)

Task #22018: Review 21814-contents-pagingResolvedTom Clegg08/13/2024Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Feature #17074: Use count=none when fetching pagesResolvedStephen SmithActions
Actions #1

Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 7 months ago

Actions #3

Updated by Peter Amstutz 7 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 7 months ago

  • Target version changed from 439 to Development 2024-06-05 sprint
Actions #5

Updated by Peter Amstutz 7 months ago

  • Assigned To set to Brett Smith
Actions #6

Updated by Brett Smith 7 months ago

IMO there's a lot of room for improvement in Workbench's queries before we get to the API server. I loaded one page of "Shared With Me" with the network inspector open, saved the HAR file, and parsed the results below.

  • By far the biggest issue is that Workbench makes a single "get user" call for each item in the page, apparently to render the owner name, without any caching or anything. Workbench is apparently grabbing all the container request UUIDs and getting information for them all with a single list request. It would be a big improvement if it could do the same for object owners.
  • When Workbench lists container requests, it selects a lot of fields that aren't obviously necessary to render these page. Some of these are potentially large, like command, environment, output, runtime_constraints, and scheduling_parameters. It would help reduce the response size and rendering time if this list could be slimmed down.
  • When Workbench requests the contents, there are some noop name conditions: ["collections.name","ilike","%%"],["container_requests.name","ilike","%%"],["groups.name","ilike","%%"]. I suspect PostgreSQL is smart enough to ignore these, but it would be a nice clean-up to not include these conditions at all.
% jq -r '.log.entries | map(.request) | map(.method + " " + .url)' 21814WorkbenchSharedWithMe.har | while read; do echo -e "${REPLY//%/\\x}"; done
[
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/groups/contents?limit=50&offset=0&exclude_home_project=true&filters=[["uuid","is_a",["arvados#group","arvados#collection","arvados#workflow","arvados#containerRequest"]],["collections.properties.type","not+in",["log","intermediate"]],["container_requests.requesting_container_uuid","=",null],["collections.name","ilike","%%"],["container_requests.name","ilike","%%"],["groups.name","ilike","%%"],["uuid","!=","jutro-j7d0g-publicfavorites"]]&order=collections.modified_at+desc,container_requests.modified_at+desc,groups.modified_at+desc,container_requests.created_at+desc",
  "GET https://jutro.arvadosapi.com/arvados/v1/groups/contents?limit=50&offset=0&exclude_home_project=true&filters=[["uuid","is_a",["arvados#group","arvados#collection","arvados#workflow","arvados#containerRequest"]],["collections.properties.type","not+in",["log","intermediate"]],["container_requests.requesting_container_uuid","=",null],["collections.name","ilike","%%"],["container_requests.name","ilike","%%"],["groups.name","ilike","%%"],["uuid","!=","jutro-j7d0g-publicfavorites"]]&order=collections.modified_at+desc,container_requests.modified_at+desc,groups.modified_at+desc,container_requests.created_at+desc",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/links",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/links",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/containers?limit=10&filters=[["uuid","in",["jutro-dz642-vllvdqn9etb3jwb","jutro-dz642-sy9v9aik24v4mn5","jutro-dz642-odq6e6qq3e7w1o5","jutro-dz642-a10uraavowzg6jl","jutro-dz642-tvg3x6srvocel5b","jutro-dz642-za85ukyklqcfjp6","jutro-dz642-2suwzy91dc4prka","jutro-dz642-mezaf9gijdepcpw","jutro-dz642-3jpldjssokrtzwv","jutro-dz642-sxe81u4b3761uzj"]]]&select=["auth_uuid",+"command",+"container_image",+"cost",+"created_at",+"cwd",+"environment",+"etag",+"exit_code",+"finished_at",+"gateway_address",+"href",+"interactive_session_started",+"kind",+"lock_count",+"locked_by_uuid",+"log",+"modified_at",+"modified_by_client_uuid",+"modified_by_user_uuid",+"output_path",+"output_properties",+"output_storage_classes",+"output",+"owner_uuid",+"priority",+"progress",+"runtime_auth_scopes",+"runtime_constraints",+"runtime_status",+"runtime_user_uuid",+"scheduling_parameters",+"started_at",+"state",+"subrequests_cost",+"uuid"]",
  "POST https://jutro.arvadosapi.com/arvados/v1/links",
  "POST https://jutro.arvadosapi.com/arvados/v1/links",
  "GET https://jutro.arvadosapi.com/arvados/v1/containers?limit=10&filters=[["uuid","in",["jutro-dz642-vllvdqn9etb3jwb","jutro-dz642-sy9v9aik24v4mn5","jutro-dz642-odq6e6qq3e7w1o5","jutro-dz642-a10uraavowzg6jl","jutro-dz642-tvg3x6srvocel5b","jutro-dz642-za85ukyklqcfjp6","jutro-dz642-2suwzy91dc4prka","jutro-dz642-mezaf9gijdepcpw","jutro-dz642-3jpldjssokrtzwv","jutro-dz642-sxe81u4b3761uzj"]]]&select=["auth_uuid",+"command",+"container_image",+"cost",+"created_at",+"cwd",+"environment",+"etag",+"exit_code",+"finished_at",+"gateway_address",+"href",+"interactive_session_started",+"kind",+"lock_count",+"locked_by_uuid",+"log",+"modified_at",+"modified_by_client_uuid",+"modified_by_user_uuid",+"output_path",+"output_properties",+"output_storage_classes",+"output",+"owner_uuid",+"priority",+"progress",+"runtime_auth_scopes",+"runtime_constraints",+"runtime_status",+"runtime_user_uuid",+"scheduling_parameters",+"started_at",+"state",+"subrequests_cost",+"uuid"]",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-59v8ow4jlft20wf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-3gkvop1l01t5nx6",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-kd1gw1tlhh7vhb7",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-59v8ow4jlft20wf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-a4qnxq3pcfcgtkz",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-kd1gw1tlhh7vhb7",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-parp8dft47kpewf",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "OPTIONS https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-t7t716xn65udys3",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-jxva24vzgik1bqn",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-kd1gw1tlhh7vhb7",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-kd1gw1tlhh7vhb7",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-t7t716xn65udys3",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl",
  "GET https://jutro.arvadosapi.com/arvados/v1/users/jutro-tpzed-9z6foyez9ydn2hl" 
]
Actions #7

Updated by Peter Amstutz 7 months ago

You are right that there's a bunch of redundant or unnecessary work happening. However the core issue is that the primary query (the contents endpoint with exclude_home_project=true) is taking 30-90 seconds on some clusters whereas those other issues mostly just add a few extra seconds (they are proportional to the page size not the total result size).

That said, it is possible that "ilike %" isn't properly optimized out, which would cause it to do a table scan. It would be a good idea, on the API server side, to discard any queries for like % and ilike % for any number of consecutive % and ? characters (as long as at least one '' is present).

Updated by Brett Smith 7 months ago

Attached are:

  • The Rails log of all the queries involved in serving /groups/contents.
  • An SQL file to show the query plan for each distinct query in there.
  • The results of running psql --no-psqlrc --file 21814Queries.sql --host localhost arvados_test arvados against a test server database (this is important, see below).

I notice this stanza repeats across a lot of the queries:

        SubPlan 1
          ->  Seq Scan on trashed_groups  (cost=0.00..28.00 rows=400 width=32)
                Filter: (trash_at <= statement_timestamp())
        SubPlan 2
          ->  Seq Scan on users  (cost=0.00..3.36 rows=28 width=28)
                Filter: ((uuid)::text <> 'zzzzz-tpzed-d9tiejq69daie8f'::text)
        SubPlan 4
          ->  Seq Scan on groups gp  (cost=0.00..3.54 rows=16 width=32)
                Filter: ((group_class)::text <> 'project'::text)

users.uuid and groups.group_class have indexes. I assume the reason they're sequential scanning is because the small test database means this caveat from the PostgreSQL documentation applies (emphasis added):

The planner's cost estimates are not linear and so it might choose a different plan for a larger or smaller table. An extreme example is that on a table that only occupies one disk page, you'll nearly always get a sequential scan plan whether indexes are available or not. The planner realizes that it's going to take one disk page read to process the table in any case, so there's no value in expending additional page reads to look at an index.

This may apply to some of the higher-level sequential scans as well. That seems harder to tell while the subplans do sequential scans.

I think to get better data, it would be good to run this SQL on a real database and get the results. Ideally the affected user's. If that's not possible, the largest Curii cluster.

I am fairly confident an index on trashed_groups.trashed_at would help. That doesn't seem to exist, and because it gets queried repeatedly, it causes repeated sequential scans.

Actions #9

Updated by Peter Amstutz 7 months ago

  • Status changed from New to In Progress
Actions #12

Updated by Peter Amstutz 7 months ago

  • Target version changed from Development 2024-06-05 sprint to Development 2024-06-19 sprint
Actions #13

Updated by Brett Smith 7 months ago

Running the plan explainer on a real database shows that just doing #17074 will improve the Workbench's performance by about two orders of magnitude. Most of the work happens when retrieving collections. This query is pretty complicated and ends up looking at most of the rows. We can improve that on the API server end, but the limits built into the query means it becomes a much, much bigger problem when doing the count. Here's the "main" query to fetch collections:

EXPLAIN SELECT collections."uuid", collections."owner_uuid", collections."created_at",  collections."modified_by_client_uuid", collections."modified_by_user_uuid", collections."modified_at", collections."name", collections."description", collections."properties", collections."portable_data_hash", collections."replication_desired", collections."replication_confirmed", collections."replication_confirmed_at", collections."storage_classes_desired", collections."storage_classes_confirmed", collections."storage_classes_confirmed_at", collections."delete_at", collections."trash_at", collections."is_trashed", collections."version", collections."current_version_uuid", collections."preserve_version", collections."file_count", collections."file_size_total" FROM "collections" WHERE (NOT (((collections.owner_uuid IN (SELECT group_uuid FROM trashed_groups WHERE trash_at <= statement_timestamp())) OR collections.trash_at <= statement_timestamp() IS TRUE)) AND collections.uuid = collections.current_version_uuid) AND (collections.owner_uuid IN (SELECT users.uuid FROM users WHERE users.uuid != ('zzzzz-tpzed-user1user1user1')) OR EXISTS(SELECT 1 FROM groups as gp where gp.uuid=collections.owner_uuid and gp.group_class != 'project')) AND ((1=0 OR 1=1 OR 1=0 OR 1=0) AND (collections.properties->>'type' NOT IN ('log','intermediate') OR collections.properties->>'type' IS NULL) AND (collections.name ilike '%%') AND (collections.uuid <> 'zzzzz-j7d0g-publicfavorites' OR collections.uuid IS NULL)) ORDER BY collections.modified_at desc LIMIT 9 OFFSET 0;

QUERY PLAN
Limit  (cost=103.90..12317.51 rows=9 width=499)
  ->  Index Scan Backward using index_collections_on_modified_at_and_uuid on collections  (cost=103.90..7865666.06 rows=5796 width=499)
        Filter: ((NOT (hashed SubPlan 1)) AND ((name)::text ~~* '%%'::text) AND (((uuid)::text <> 'zzzzz-j7d0g-publicfavorites'::text) OR (uuid IS NULL)) AND ((uuid)::text = (current_version_uuid)::text) AND ((trash_at <= statement_timestamp()) IS NOT TRUE) AND (((properties ->> 'type'::text) <> ALL ('{log,intermediate}'::text[])) OR ((properties ->> 'type'::text) IS NULL)) AND ((hashed SubPlan 2) OR (hashed SubPlan 4)))
        SubPlan 1
          ->  Seq Scan on trashed_groups  (cost=0.00..17.61 rows=707 width=28)
                Filter: (trash_at <= statement_timestamp())
        SubPlan 2
          ->  Index Only Scan using index_users_on_uuid on users  (cost=0.28..76.50 rows=3040 width=28)
                Filter: ((uuid)::text <> 'zzzzz-tpzed-user1user1user1'::text)
        SubPlan 4
          ->  Seq Scan on groups gp  (cost=0.00..1314.24 rows=182 width=32)
                Filter: ((group_class)::text <> 'project'::text)

Note that the index scan has a huge total cost, but it's capped by the parent limit node. When you do the count, you lift the limit and have to pay the entire cost:

EXPLAIN SELECT COUNT(*) FROM "collections" WHERE (NOT (((collections.owner_uuid IN (SELECT group_uuid FROM trashed_groups WHERE trash_at <= statement_timestamp())) OR collections.trash_at <= statement_timestamp() IS TRUE)) AND collections.uuid = collections.current_version_uuid) AND (collections.owner_uuid IN (SELECT users.uuid FROM users WHERE users.uuid != ('zzzzz-tpzed-user1user1user1')) OR EXISTS(SELECT 1 FROM groups as gp where gp.uuid=collections.owner_uuid and gp.group_class != 'project')) AND ((1=0 OR 1=1 OR 1=0 OR 1=0) AND (collections.properties->>'type' NOT IN ('log','intermediate') OR collections.properties->>'type' IS NULL) AND (collections.name ilike '%%') AND (collections.uuid <> 'zzzzz-j7d0g-publicfavorites' OR collections.uuid IS NULL));

QUERY PLAN
Aggregate  (cost=7826420.24..7826420.25 rows=1 width=8)
  ->  Seq Scan on collections  (cost=103.47..7826405.75 rows=5796 width=0)
        Filter: ((NOT (hashed SubPlan 1)) AND ((name)::text ~~* '%%'::text) AND (((uuid)::text <> 'zzzzz-j7d0g-publicfavorites'::text) OR (uuid IS NULL)) AND ((uuid)::text = (current_version_uuid)::text) AND ((trash_at <= statement_timestamp()) IS NOT TRUE) AND (((properties ->> 'type'::text) <> ALL ('{log,intermediate}'::text[])) OR ((properties ->> 'type'::text) IS NULL)) AND ((hashed SubPlan 2) OR (hashed SubPlan 4)))
        SubPlan 1
          ->  Seq Scan on trashed_groups  (cost=0.00..17.61 rows=707 width=28)
                Filter: (trash_at <= statement_timestamp())
        SubPlan 2
          ->  Index Only Scan using index_users_on_uuid on users  (cost=0.28..76.50 rows=3040 width=28)
                Filter: ((uuid)::text <> 'zzzzz-tpzed-user1user1user1'::text)
        SubPlan 4
          ->  Seq Scan on groups gp  (cost=0.00..1314.24 rows=182 width=32)
                Filter: ((group_class)::text <> 'project'::text)

On this database, just this one query is two orders of magnitude more expensive than any other run by a /groups/contents request.

Actions #14

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-06-19 sprint to Development 2024-07-03 sprint
Actions #15

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint
Actions #16

Updated by Peter Amstutz 6 months ago

  • Target version changed from Development 2024-07-24 sprint to Development 2024-08-07 sprint
Actions #17

Updated by Peter Amstutz 5 months ago

  • Assigned To changed from Brett Smith to Stephen Smith
Actions #18

Updated by Peter Amstutz 5 months ago

  • Subject changed from Bad performance of "shared with me" to How to efficiently page the "shared with me" query
Actions #19

Updated by Peter Amstutz 5 months ago

  • Assigned To changed from Stephen Smith to Tom Clegg
  • Subject changed from How to efficiently page the "shared with me" query to How to efficiently page group contents queries of more than one object type (e.g. "shared with me")
Actions #20

Updated by Stephen Smith 5 months ago

Some additional investigation needed:

In theory (according to Peter) using count=none alone should lighten the load across the board, but for some reason I only see a noticeable impact on the first page, which takes it from ~5s to ~0.2s. Once an offset is added (even just offset=1), it performs about the same as when count=exact.

Example queries:
https://tordo.arvadosapi.com/arvados/v1/groups/contents?limit=50&offset=1&exclude_home_project=true&count=none

(as brett suggested above I removed the empty filters which takes 5s down to 2s, but that only works if you're not using filters and there's still a noticeable difference between 2s and 0.2s)

https://tordo.arvadosapi.com/arvados/v1/groups/contents?limit=50&offset=0&exclude_home_project=true&count=none

Offset 0 completes in 0.2 seconds.

Actions #21

Updated by Brett Smith 5 months ago

I'm happy to run my analysis with offset=1 and see if that offers any more insight. I would expect it to run a little slower, because the way the API server currently works is it just keeps loading results until it has data for the current page, then it returns that page and just discards everything it loaded before. So the further out your offset, the more data it loads and discards before it can return a result. But I would expect that to be a small time increase with offset=1, not an order of magnitude, so I'll dig into that.

Actions #22

Updated by Tom Clegg 5 months ago

This was added to source:services/api/app/controllers/arvados/v1/groups_controller.rb#L212 in #17589#note-20:

    if params['count'] == 'none' and @offset != 0 and (params['last_object_class'].nil? or params['last_object_class'].empty?)
      # can't use offset without getting counts, so                                                                                                                           
      # fall back to count=exact behavior.                                                                                                                                    
      params['count'] = 'exact'
      set_count_none = true
    end

It looks like it's time for a different solution to that previous problem.

Actions #23

Updated by Peter Amstutz 5 months ago

  • Assigned To changed from Tom Clegg to Peter Amstutz
Actions #24

Updated by Peter Amstutz 5 months ago

21814-contents-paging @ 64dd84f22415c55e69403eacbfec431b5568921c

developer-run-tests: #4367

  • All agreed upon points are implemented / addressed.
    • Avoids getting full table counts unless necessary; worst case is fetching (offset*2 + limit) rows
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • no
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • Passes existing count=none + offset test
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • Should greatly improve scale by making paging work much more efficiently with group contents and count=none
  • Considered backwards and forwards compatibility issues between client and server.
    • Does not change the behavior of the API, but is more efficient.
    • I removed an undocumented legacy parameter 'last_object_class', see below
  • Follows our coding standards and GUI style guidelines.
    • yes

Now works by fetching with count=none first. Only if no rows are returned and offset > 0 then it gets a full row count to determine how to properly adjust the offset for query into the next table.

Because it interacted with count=none and offset in a way that was difficult to determine original intent, I removed the legacy undocumented 'last_object_class' parameter. This was only ever used by workbench 1, was never documented, and has no tests. The option never was included in the Go SDK 'GroupContentsOptions' struct so I believe this means it has already been the case for years that Controller would just ignore it.

Actions #25

Updated by Tom Clegg 4 months ago

This looks like the right solution.

I think this comment is a little misleading/confusing:

      # count=none, offset > 0, and nothing was returned, we don't
      # know if this meant there were no matches, or offset caused it
      # to run off the end

If I'm following correctly, one of those possibilities is a special case of the other, and we don't actually care which is true because they are handled exactly the same way. Does this alternative explanation sound accurate?

      # The appropriate @offset for querying the next table depends on
      # how many matching rows in this table were skipped due to the
      # current @offset. If we retrieved any items (or @offset is
      # already zero), then clearly exactly @offset rows were skipped,
      # and the correct @offset for the next table is zero.
      # Otherwise, we need to count all matching rows in the current
      # table, and subtract that from @offset. If our previous query
      # used count=none, we will need an additional query to get that
      # count.

I think @offset = 0 in "just get the count" is superfluous, because object_list() already uses except(:offset) to get the count.

I tried simply changing set_count_none = true to ... = false in main, and all of our GroupsController tests passed. I think this means we need some new tests to exercise the new code.

Actions #26

Updated by Peter Amstutz 4 months ago

Tom Clegg wrote in #note-25:

This looks like the right solution.

I think this comment is a little misleading/confusing:

[...]

If I'm following correctly, one of those possibilities is a special case of the other, and we don't actually care which is true because they are handled exactly the same way. Does this alternative explanation sound accurate?

[...]

Your explanation is accurate, I updated the comment.

I think @offset = 0 in "just get the count" is superfluous, because object_list() already uses except(:offset) to get the count.

Even better, I pulled out just the query to get just the count, which both avoids fumbling with @offset and @limit and and also avoids a completely unnecessary query for items (which, even though we already know it returns no results, still takes time).

I tried simply changing set_count_none = true to ... = false in main, and all of our GroupsController tests passed. I think this means we need some new tests to exercise the new code.

Did you run "count none works with offset" at api/test/integration/groups_test.rb ?

21814-contents-paging @ 64e0404403e98e9562e14f43c3cf54bb3c94bb85

developer-run-tests: #4375

Actions #27

Updated by Tom Clegg 4 months ago

Even better, I pulled out just the query to get just the count

Yes, even better.

Did you run "count none works with offset" at api/test/integration/groups_test.rb ?

Ah, no, I only ran the controller tests.

If that's our only test for the paging logic, maybe it's worth adding a check that the first query (offset=0) actually returns multiple object types. (It does now, but fixtures could change and silently neuter the test -- it would also make it more obvious what the test case is trying to test).

Other than that, LGTM, thanks.

Actions #28

Updated by Peter Amstutz 4 months ago

  • Status changed from In Progress to Resolved
Actions #29

Updated by Peter Amstutz 4 months ago

  • Release set to 70
Actions

Also available in: Atom PDF