Bug #20183
closedMemory usage: Move after_commit UpdatePriority.run_update_thread to controller
Description
Currently, the UpdatePriority.run_update_thread hook appears to be associated with unusually high memory usage in passenger workers processes.
Proposed fix is to handle these asynchronous updates using a rails handler called by a controller thread, similar to trash sweeps.
After a container create/update/delete call, controller should- acquire a (new) dblock.DBLocker,
UpdateContainerPriority
, waiting if necessary - check for "priority>0 but should be 0" and "priority=0 but should be >0" using queries like the ones in the current ruby code source:services/api/lib/update_priority.go
- if any need fixing, call rails to fix them in batches, using a (new) system-only endpoint
update_priority
- release
UpdateContainerPriority
lock
This should be done in a singleton goroutine that runs separately from the http request handlers, such that no more than 1 check-and-correct operation is in progress at any given time, and the http client doing a container update doesn't need to wait for it to finish.
When anything needs fixing, controller should log the timing and success/failure (according to rails) of each batch.
Related issues
Updated by Tom Clegg over 1 year ago
- Description updated (diff)
- Subject changed from Move after_commit UpdatePriority.run_update_thread to controller to Memory usage: Move after_commit UpdatePriority.run_update_thread to controller
Updated by Tom Clegg over 1 year ago
- Target version changed from Future to Development 2023-03-15 sprint
- Assigned To set to Tom Clegg
- Status changed from New to In Progress
Updated by Tom Clegg over 1 year ago
- Related to Feature #20192: Move AuditLogs.tidy_in_background from Rails to controller added
Updated by Tom Clegg over 1 year ago
20183-update-priority-thread @ ba41b025030022f6becf48f569d74c36f2a45ce4 -- developer-run-tests: #3532
I think I found an explanation for the pathological behavior we were seeing (a priority-updater thread seemingly stuck using lots of CPU and RAM).
In the Rails implementation the "find containers that have priority=0 but need priority>0" query was returning all containers with active (committed, priority>0) requests. However, it is possible for all such requests to have parent containers (requesting_container_uuid) with priority=0, in which case Container.update_priority!() leaves the container priority at 0.
With a big enough set of N containers, this could effectively cause infinite recursion: the updater thread called update_priority!() on each, which left priority=0 but still triggered N after_commit hooks, at least one of which (assuming the entire batch of these futile updates took >5s to process) started a new updater thread, which called update_priority!() on all N containers, etc.
Updated by Peter Amstutz over 1 year ago
20183-update-priority-thread @ cd171959303236ba9a85e6524e6c40310203325b
A prose explanation of what is happening/intended to happen in containerPriorityUpdate
would be helpful, it is pretty dense with the joins.
Here's my try:
Get containers, where the container is queued/locked/running, the container's current priority is zero, the associated container request priority is greater than zero, and the parent priority is greater than zero (or there is no parent).
For these containers, the priority is wrong, so tell the rails API to sync the priority for this container. Rails API sets the priority on the container to match either the parent container (if exists) or the container request (if not). This is propagated down to all child container requests, which recursively sync their priority with the parent the same way.
WHERE containers.state IN ('Queued', 'Locked', 'Running')
AND containers.priority = 0
AND container_requests.uuid IS NOT NULL
AND (parent.uuid IS NULL OR parent.priority > 0)
What situation is container_requests.uuid IS NOT NULL
needed? The JOIN container_requests ON container_requests.container_uuid=containers.uuid
is an inner join so it won't return any container requests that don't have corresponding containers, I think.
I noticed another detail:
This is the query in update_priority!
p = ContainerRequest.
where('container_uuid=? and priority>0', uuid).
This is the corresponding part of the query in containerPriorityUpdate:
JOIN container_requests
ON container_requests.container_uuid=containers.uuid
AND container_requests.state = 'Committed' AND container_requests.priority > 0
I don't know if they should both check for 'Committed' or the containerPriorityUpdate should remove it, but it is slightly inconsistent.
So if I'm not missing anything, I think this would be clearer and align better with what update_priority!
is doing:
err := db.QueryRowxContext(ctx, ` SELECT containers.uuid from containers JOIN container_requests ON container_requests.container_uuid=containers.uuid AND container_requests.priority > 0 LEFT JOIN containers parent ON parent.uuid = container_requests.requesting_container_uuid WHERE containers.state IN ('Queued', 'Locked', 'Running') AND containers.priority = 0 AND (parent.uuid IS NULL OR parent.priority > 0) LIMIT 1`).Scan(&uuid)
One last thought, since priority propagates downwards, would it make sense to add ORDER BY created_at
so it checks the parent containers before the child containers?
Updated by Peter Amstutz over 1 year ago
Since Tom is out of office today but I want to build a release candidate, I am going to merge this, none of my comments strictly concern correctness and I'm satisfied with the tests. Will leave the ticket open, however.
Updated by Peter Amstutz over 1 year ago
- Status changed from In Progress to Feedback
Updated by Tom Clegg over 1 year ago
- Status changed from Feedback to In Progress
20183-update-priority-thread @ d9b8d0295f1fa1c1690b9fe80ba38f6f665de4e9 -- developer-run-tests: #3544
retry fuse developer-run-tests-services-fuse: #3590
- add comments
- make the rails/controller sql queries agree (see commit message)
- order by created_at
Updated by Peter Amstutz over 1 year ago
Tom Clegg wrote in #note-9:
20183-update-priority-thread @ d9b8d0295f1fa1c1690b9fe80ba38f6f665de4e9 -- developer-run-tests: #3544
retry fuse developer-run-tests-services-fuse: #3590
- add comments
- make the rails/controller sql queries agree (see commit message)
- order by created_at
This LGTM thanks!
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-03-15 sprint to Development 2023-03-29 Sprint
Updated by Tom Clegg over 1 year ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|5dd128f5a57e704e3b3ea5225130ca85bd3bb84c.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-03-29 Sprint to Development 2023-03-15 sprint