Permission table slowness
Reported that permission updates on a large / busy cluster get slower and slower over time. Doing a "VACUUM FULL" fixes it but the slowness comes back after a few days.
Currently speculating that it may be due to the fact that every request that updates permission, trashed projects, or frozen projects currently creates a temporary table.
I think the queries that use temporary tables can be rewritten to use a CTE fairly easily, so that's worth a try.
Updated by Peter Amstutz 5 months ago
21030-update-perm-cte @ 4727334c7ef28cdb7fb2df89211c09eb5d51bef4
Note each item completed with additional detail if necessary. If an item is irrelevant to a specific branch, briefly explain why.
- All agreed upon points are implemented / addressed.
- The main bottleneck on permission updates has been identified, and the solution has been confirmed to improve performance on real-world user clusters.
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- Code is tested and passing, both automated and manual, what manual testing was done is described
- The changes to remove temporary tables are pass all tests
- The new index has been manually confirmed to improve performance by looking at query plans & query execution times
- Documentation has been updated.
- Behaves appropriately at the intended scale (describe intended scale).
- By avoiding a sequential scan, this both greatly improves performance and should hopefully also avoid the performance deterioration of accumulated dead rows between table vacuums.
- Considered backwards and forwards compatibility issues between client and server.
- Follows our coding standards and GUI style guidelines.
This branch has two improvements in response to reported performance problems.
The first improvement is to eliminate the creation of temporary tables. This turned out to not be the main source of performance problems, however research indicated that it is not best practice to use temporary tables for frequently executed queries, and confirmed that the pg_attributes table was accumulating dead rows due to the use of temporary tables. The temporary tables have been replaced by Computed Table Expressions (CTEs) which behave equivalently for our purposes but apparently have less overhead.
The second improvement, which does make a large performance improvement, is a new index on materialized permissions. Here's the comment in the code:
# The inner part of compute_permission_subgraph has a query clause like this: # # where u.perm_origin_uuid = m.target_uuid AND m.traverse_owned # AND (m.user_uuid = m.target_uuid or m.target_uuid not like '_____-tpzed-_______________') # # This will end up doing a sequential scan on # materialized_permissions, which can easily have millions of # rows, unless we fully index the table for this query. In one test, # this brought the compute_permission_subgraph query from over 6 # seconds down to 250ms.
A third detail is that we confirmed that the "SET LOCAL enable_mergejoin to false;" workaround is still needed on Postgres 14, so you'll see in the branch history that I took it out and then put it back.