Project

General

Profile

Actions

Bug #20183

closed

Memory usage: Move after_commit UpdatePriority.run_update_thread to controller

Added by Tom Clegg over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Story points:
-
Release relationship:
Auto

Description

Currently, the UpdatePriority.run_update_thread hook appears to be associated with unusually high memory usage in passenger workers processes.

Proposed fix is to handle these asynchronous updates using a rails handler called by a controller thread, similar to trash sweeps.

After a container create/update/delete call, controller should
  • acquire a (new) dblock.DBLocker, UpdateContainerPriority, waiting if necessary
  • check for "priority>0 but should be 0" and "priority=0 but should be >0" using queries like the ones in the current ruby code source:services/api/lib/update_priority.go
  • if any need fixing, call rails to fix them in batches, using a (new) system-only endpoint update_priority
  • release UpdateContainerPriority lock

This should be done in a singleton goroutine that runs separately from the http request handlers, such that no more than 1 check-and-correct operation is in progress at any given time, and the http client doing a container update doesn't need to wait for it to finish.

When anything needs fixing, controller should log the timing and success/failure (according to rails) of each batch.


Subtasks 1 (0 open1 closed)

Task #20194: Review 20183-update-priority-threadResolvedPeter Amstutz03/06/2023Actions

Related issues

Related to Arvados - Feature #20192: Move AuditLogs.tidy_in_background from Rails to controllerNewActions
Actions #1

Updated by Tom Clegg over 1 year ago

  • Description updated (diff)
  • Subject changed from Move after_commit UpdatePriority.run_update_thread to controller to Memory usage: Move after_commit UpdatePriority.run_update_thread to controller
Actions #2

Updated by Tom Clegg over 1 year ago

  • Target version changed from Future to Development 2023-03-15 sprint
  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress
Actions #3

Updated by Tom Clegg over 1 year ago

  • Related to Feature #20192: Move AuditLogs.tidy_in_background from Rails to controller added
Actions #4

Updated by Tom Clegg over 1 year ago

20183-update-priority-thread @ ba41b025030022f6becf48f569d74c36f2a45ce4 -- developer-run-tests: #3532

I think I found an explanation for the pathological behavior we were seeing (a priority-updater thread seemingly stuck using lots of CPU and RAM).

In the Rails implementation the "find containers that have priority=0 but need priority>0" query was returning all containers with active (committed, priority>0) requests. However, it is possible for all such requests to have parent containers (requesting_container_uuid) with priority=0, in which case Container.update_priority!() leaves the container priority at 0.

With a big enough set of N containers, this could effectively cause infinite recursion: the updater thread called update_priority!() on each, which left priority=0 but still triggered N after_commit hooks, at least one of which (assuming the entire batch of these futile updates took >5s to process) started a new updater thread, which called update_priority!() on all N containers, etc.

Actions #5

Updated by Peter Amstutz over 1 year ago

20183-update-priority-thread @ cd171959303236ba9a85e6524e6c40310203325b

A prose explanation of what is happening/intended to happen in containerPriorityUpdate would be helpful, it is pretty dense with the joins.

Here's my try:

Get containers, where the container is queued/locked/running, the container's current priority is zero, the associated container request priority is greater than zero, and the parent priority is greater than zero (or there is no parent).

For these containers, the priority is wrong, so tell the rails API to sync the priority for this container. Rails API sets the priority on the container to match either the parent container (if exists) or the container request (if not). This is propagated down to all child container requests, which recursively sync their priority with the parent the same way.

WHERE containers.state IN ('Queued', 'Locked', 'Running')
AND containers.priority = 0
AND container_requests.uuid IS NOT NULL
AND (parent.uuid IS NULL OR parent.priority > 0)

What situation is container_requests.uuid IS NOT NULL needed? The JOIN container_requests ON container_requests.container_uuid=containers.uuid is an inner join so it won't return any container requests that don't have corresponding containers, I think.

I noticed another detail:

This is the query in update_priority!

p = ContainerRequest.
where('container_uuid=? and priority>0', uuid).

This is the corresponding part of the query in containerPriorityUpdate:

JOIN container_requests
ON container_requests.container_uuid=containers.uuid
AND container_requests.state = 'Committed' AND container_requests.priority > 0

I don't know if they should both check for 'Committed' or the containerPriorityUpdate should remove it, but it is slightly inconsistent.

So if I'm not missing anything, I think this would be clearer and align better with what update_priority! is doing:

        err := db.QueryRowxContext(ctx, `
            SELECT containers.uuid from containers
            JOIN container_requests
             ON container_requests.container_uuid=containers.uuid
             AND container_requests.priority > 0
            LEFT JOIN containers parent
             ON parent.uuid = container_requests.requesting_container_uuid
            WHERE containers.state IN ('Queued', 'Locked', 'Running')
             AND containers.priority = 0
             AND (parent.uuid IS NULL OR parent.priority > 0)
            LIMIT 1`).Scan(&uuid)

One last thought, since priority propagates downwards, would it make sense to add ORDER BY created_at so it checks the parent containers before the child containers?

Actions #6

Updated by Peter Amstutz over 1 year ago

Since Tom is out of office today but I want to build a release candidate, I am going to merge this, none of my comments strictly concern correctness and I'm satisfied with the tests. Will leave the ticket open, however.

Actions #7

Updated by Peter Amstutz over 1 year ago

  • Release set to 57
Actions #8

Updated by Peter Amstutz over 1 year ago

  • Status changed from In Progress to Feedback
Actions #9

Updated by Tom Clegg over 1 year ago

  • Status changed from Feedback to In Progress

20183-update-priority-thread @ d9b8d0295f1fa1c1690b9fe80ba38f6f665de4e9 -- developer-run-tests: #3544

retry fuse developer-run-tests-services-fuse: #3590

  • add comments
  • make the rails/controller sql queries agree (see commit message)
  • order by created_at
Actions #10

Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote in #note-9:

20183-update-priority-thread @ d9b8d0295f1fa1c1690b9fe80ba38f6f665de4e9 -- developer-run-tests: #3544

retry fuse developer-run-tests-services-fuse: #3590

  • add comments
  • make the rails/controller sql queries agree (see commit message)
  • order by created_at

This LGTM thanks!

Actions #11

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-03-15 sprint to Development 2023-03-29 Sprint
Actions #12

Updated by Tom Clegg over 1 year ago

  • Status changed from In Progress to Resolved
Actions #13

Updated by Peter Amstutz over 1 year ago

  • Target version changed from Development 2023-03-29 Sprint to Development 2023-03-15 sprint
Actions

Also available in: Atom PDF