Bug #18339
closedSweepTrashedObjects scaling issues
Description
Customer reported problem in which a large number of projects, which had been trashed two weeks previously, became eligible for deletion all at once, and the resulting sweep processes crushed the database.
- If the last sweep hasn't finished, currently it starts a new, overlapping sweep operation. Instead, it should probably do something like try to take a lock (perhaps by on a empty, special-purpose table), and then if it can't acquire the lock, don't do anything (it checks the sweep task periodically, so it'll eventually run again when lock is no longer held).
- The default value for TrashSweepInterval is 60s. Deleting trashed projects is a very low priority process, so the default interval could be much lower (eg 5m or 15m or 60m)
- The delete operations should run in a transaction, to reduce overhead from auto-commit. Although, if it takes any locks they will be held for the duration of the transaction.
- Destroying a group also destroys any permission links pointing to it. Ruby on Rails is running the before_destroy hook for the permission link, which does an update_permissions operation, which takes the table lock on materialized_permissions. We can set
Thread.current[:suppress_update_permissions] = true
to prevent permissions from being recomputed, we just need to be confident that direct updates to the permission table are correct.
Related issues
Updated by Tom Clegg almost 3 years ago
- Status changed from New to In Progress
- Description updated (diff)
Updated by Tom Clegg almost 3 years ago
Maybe this is a good time to establish a better pattern for background tasks.
Propose:- Every controller process runs a "periodic trash sweep" goroutine
- Use pg_advisory_lock() to ensure only one such goroutine is active even when there are multiple controller processes
- RailsAPI has an admin-only synchronous "/sys/sweep_trashed_objects" endpoint instead of a background task
One side effect is that sweep_trashed_objects will run even when the cluster is idle (vs. current implementation which only runs when collection lookups happen).
Updated by Tom Clegg almost 3 years ago
18339-sweep-trash-lock @ aa3d4030686f5db784dcaf2d7f28225eb98c4267 -- developer-run-tests: #2800
Updated by Lucas Di Pentima almost 3 years ago
Some questions & comments:
- File sys_controller.rb
- Related comment: Looking at the old code that was kept on this branch, I'm seeing at line 22 that only project groups are selected for deletion, so other non project groups that have been being trashed will live forever on the trash unless they're untrashed and deleted again after #18340 is merged and released? Am I thinking this right? If so, maybe we'll need to add a cleanup migration?
- Lines 26-29 take care of moving trash-at-the-future objects to the trash, should that still need to be tied with the trash sweeping behavior? If the site config is set to garbage collect every N hours, the auto-trashing will be affected and may not be what the users expect? (I know this is the way it worked before, but still think is worth asking, maybe for another ticket)
- Would it be interesting to expose the
/sys/sweep_trashed_objects
to the site admin so that they can make manual sweeps?
Other than that, LGTM.
Updated by Tom Clegg almost 3 years ago
Lucas Di Pentima wrote:
[...] non project groups that have been being trashed will live forever on the trash unless they're untrashed and deleted again after #18340 is merged and released? Am I thinking this right? If so, maybe we'll need to add a cleanup migration?
It seems easiest, and possibly more efficient, to remove the where({group_class: 'project'}).
conditions. Then, pre-#18340-trashed zombie projects will be deleted without a migration.
...and add an upgrade note in case anyone is currently relying on zombie roles.
- Lines 26-29 take care of moving trash-at-the-future objects to the trash, should that still need to be tied with the trash sweeping behavior? If the site config is set to garbage collect every N hours, the auto-trashing will be affected and may not be what the users expect? (I know this is the way it worked before, but still think is worth asking, maybe for another ticket)
I think this is why the default sweep interval is 60s. The config comment does mention that auto-trash is affected, although I suppose determining the user-visible consequences of a longer interval is left as an exercise to the reader™... do we need to clarify this?
- Would it be interesting to expose the
/sys/sweep_trashed_objects
to the site admin so that they can make manual sweeps?
I suppose it's good form to add it to the discovery doc, at least.
Updated by Tom Clegg almost 3 years ago
- delete zombie role groups
- add upgrade note
- add test cases (missed adding the file in previous commit)
Updated by Lucas Di Pentima almost 3 years ago
One last observation:
- I think there's one missing place that's enforcing only projects to be deleted:
services/api/app/controllers/sys_controller.rb:L41
Other than that, LGTM!
Updated by Peter Amstutz almost 3 years ago
The upgrade notes should probably note the implications of the old behavior (role groups you thought were deleted, were not) and what the cleanup will do (delete them for real).
Updated by Tom Clegg almost 3 years ago
18339-sweep-trash-lock @ 8a33f7899da36343e687febb63678d90e83b7d63 -- developer-run-tests: #2807
Updated by Tom Clegg almost 3 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados-private:commit:arvados|e6769d20505e2c8c74b2d7e3f9c2f33f2a2db092.
Updated by Ward Vandewege over 2 years ago
- Related to Bug #18762: rails background tasks scaling issues added