Project

General

Profile

Actions

Bug #20472

closed

Too many ProbesPerSecond kills the database

Added by Peter Amstutz 10 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Start date:
05/04/2023
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

I set the dispatcher ProbesPerSecond to 10 and it rapidly started to generate a request backlog on the API server. Seems like it is still fighting over the container table lock.

When I set it back to 5 per second, the backlog went away.

What the probes doing that causes it to slam the database like that?

Do container table updates need a separate limiter like the logs endpoint so they don't kill the API server?


Subtasks 1 (0 open1 closed)

Task #20478: Review 20472-priority-updateResolvedPeter Amstutz05/04/2023

Actions
Actions #1

Updated by Peter Amstutz 10 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 10 months ago

Possibly related, I'm seeing requests like this, which start at half a second but trend upward to 4 seconds or more in timeToStatus:

May  2 23:01:14 ip-172-25-135-50 arvados-controller37525: {"ClusterID":"xxxxx","PID":37525,"RequestID":"req-s7fhgp22ogj91n6ek50f","level":"info","msg":"response","remoteAddr":"127.0.0.1:53290","reqBytes":48,"reqForwardedFor":"172.25.133.12","reqHost":"xngs1.rdcloud.bms.com:443","reqMethod":"PUT","reqPath":"arvados/v1/containers/xngs1-dz642-sn7gk7p6rx2p51g","reqQuery":"","respBytes":369734,"respStatus":"OK","respStatusCode":200,"time":"2023-05-02T23:01:14.469931456Z","timeToStatus":0.534452,"timeTotal":0.535109,"timeWriteBody":0.000657}

Actions #3

Updated by Peter Amstutz 10 months ago

Before we added the big lock we had been running with 400 nodes at one point and with the big lock we can't go over 200, and the failure mode seems to be that the entire system comes to a screeching halt when the request queue fills up.

We need to figure out a better way to do this.

Could we update priorities using a recursive query, similar to the way that the permission updates work?

Actions #4

Updated by Peter Amstutz 10 months ago

Container priority update notes.

Container priority is based on max(priority) of container requests.

If a container request is owned by a container, then the applicable priority is inherited from the parent container.

If a container request is a root, the the container priority is computed like this:

(cr.priority << 50) - (cr.created_at.to_time.to_f * 1000).to_i

what that's saying is that the smaller value of created_at (earlier in time) will get a bigger value so it runs first

Take the max and propagate it downward.

The thing that is a bit complicated is that for each container, there could be more than one container request controlling priority. So we propagate to child containers and they go through the same calculation of looking at the parent requests.

The priority of a child container request is completely irrelevant, only the root requests matter, right?

-- this means if I set the priority to 0 on a container request, right now the UI will suggest it is in "cancelling" state but it won't have any impact on the priority of the underlying container

This means we can't actually cancel individual workflow steps in the current code, which explains some of the weird behavior I've been seeing.

TBC

Actions #5

Updated by Peter Amstutz 10 months ago

New update strategy

Use a recursive query to construct a table with the updated container priorities, then use that to update the containers table.

  1. Start from a container request and initial propagated priority
  2. Get container requests with the same container_uuid (excluding the one we are starting from)
    1. Make a temporary table of (CR, priority)
      1. CR we visited is associated with the propagated priority
      2. other CR with requesting_container_uuid gets the priority of the requesting container
      3. other CR with no requesting_container_uuid gets computed priority based on the CR priority
    2. Get max(priority)
    3. Emit query row (container_uuid, max(priority))
    4. Get container requests with matching requesting_container_uuid (child requests)
  3. Recurse on each container request
Actions #6

Updated by Peter Amstutz 10 months ago

  • Assigned To set to Peter Amstutz
  • Status changed from New to In Progress
  • Category set to API
Actions #7

Updated by Peter Amstutz 10 months ago

20472-priority-update @ e881f30542392c8328909139a94609230aee9975

  • Use a recursive query to compute and update container priorities instead of Ruby code
Actions #9

Updated by Peter Amstutz 10 months ago

20472-priority-update @ db14d5a612dd8db80a97d954f3364f23f21b005d

This just adds a new test, which didn't uncover any issues.

developer-run-tests: #3637

Actions #10

Updated by Peter Amstutz 10 months ago

This error is odd.

App 23218 output: [req-gyzur3lc1uytnwszsubh]   ESC[1mESC[36mContainer Load (0.8ms)ESC[0m  ESC[1mESC[34mSELECT  containers."uuid", containers."priority" FROM "containers" WHERE (NOT ((container
s.owner_uuid IN (SELECT group_uuid FROM trashed_groups WHERE trash_at <= statement_timestamp())))) AND (containers.uuid='zzzzz-dz642-k21k06fyd1rvefg') ORDER BY "containers"."id" ASC LIMIT $1 O
FFSET $2ESC[0m  [["LIMIT", 1], ["OFFSET", 0]]

App 23218 output: [req-gyzur3lc1uytnwszsubh]   ESC[1mESC[36mupdate_priorities (17.0ms)ESC[0m  ESC[1mESC[33m
App 23218 output: update containers set priority=computed.upd_priority from (select pri_container_uuid, upd_priority from update_priorities($1) order by pri_container_uuid) as computed
App 23218 output:  where containers.uuid = computed.pri_container_uuid and priority != computed.upd_priority
App 23218 output: ESC[0m  [[nil, "zzzzz-dz642-k21k06fyd1rvefg"]]
App 23218 output: #<ActiveRecord::RecordNotFound: Couldn't find Container without an ID>
App 23218 output: Error 1683217405+4bbf95ec: 404
App 23218 output: {"method":"POST","path":"/arvados/v1/containers/zzzzz-dz642-k21k06fyd1rvefg/update_priority","format":"html","controller":"Arvados::V1::ContainersController","action":"update_priority","status":404,"duration":24.4,"view":0.4,"db":17.78,"ClusterID":"zzzzz","request_id":"req-gyzur3lc1uytnwszsubh","client_ipaddr":"127.0.0.1","client_auth":"zzzzz-gj3su-000000000000000","params":{"select":"[\"uuid\",\"priority\"]"},"@timestamp":"2023-05-04T16:23:25.999084961Z","@version":"1","message":"[404] POST /arvados/v1/containers/zzzzz-dz642-k21k06fyd1rvefg/update_priority (Arvados::V1::ContainersController#update_priority)"}

Actions #11

Updated by Peter Amstutz 10 months ago

20472-priority-update @ d018aa211d3efc93c711248851db379b19fa60a1

  • Put back some code in find_objects_for_index that turns out still needed to be there.

developer-run-tests: #3638

Actions #12

Updated by Tom Clegg 10 months ago

Some comments deserve to be updated/removed

       # Lock containers table to avoid deadlock in cascading priority update (see #20240)

2 of these:

       # If no attributes are being updated besides these, there are no
       # cascading changes to other rows/tables, so we should just use
       # row locking.

Is the "when container_requests.priority = 0" case impossible/superfluous here due to "where ... container_requests.priority > 0" in the query?

select coalesce(max(case when container_requests.priority = 0 then 0
                         when containers.uuid = inherited_from then inherited
                         when containers.priority is not NULL then containers.priority
                         else container_requests.priority * 1125899906842624::bigint - (extract(epoch from container_requests.created_at)*1000)::bigint
                    end), 0) from
    container_requests left outer join containers on container_requests.requesting_container_uuid = containers.uuid
    where container_requests.container_uuid = for_container_uuid and container_requests.state = 'Committed' and container_requests.priority > 0;

I found the "update" in the sql function name update_priorities a little confusing -- maybe container_tree_priority would make it a little clearer how it relates to container_tree and container_priority?

Do we also need to call row_lock_for_priority_update when assigning a container_uuid to a new or existing CR?

Actions #13

Updated by Peter Amstutz 10 months ago

Tom Clegg wrote in #note-12:

Some comments deserve to be updated/removed

Is the "when container_requests.priority = 0" case impossible/superfluous here due to "where ... container_requests.priority > 0" in the query?

yes. Removed.

I found the "update" in the sql function name update_priorities a little confusing -- maybe container_tree_priority would make it a little clearer how it relates to container_tree and container_priority?

Now called "container_tree_priorities"

Do we also need to call row_lock_for_priority_update when assigning a container_uuid to a new or existing CR?

Yes, I think you're right. I checked and using "SELECT ORDER BY FOR UPDATE" is the correct way to ensure the locks are acquired in a specific order.

20472-priority-update @ d30cad367216981766344fea34c3d439bafbc8f5

developer-run-tests: #3640

re-run wb1 test: developer-run-tests-apps-workbench-integration: #3929

Actions #14

Updated by Tom Clegg 10 months ago

LGTM, thanks

Actions #15

Updated by Peter Amstutz 10 months ago

  • Status changed from In Progress to Resolved
Actions #16

Updated by Peter Amstutz 6 months ago

  • Release set to 66
Actions

Also available in: Atom PDF