Project

General

Profile

Actions

Bug #20447

closed

Container table lock contention

Added by Peter Amstutz 11 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Story points:
-
Release relationship:
Auto

Description

I need to look at postgres status to see what is going on, but I have a theory:

  1. We put a "big lock" around the containers table, all write operations have to take an exclusive lock on the table (unfortunately this includes container operations that don't affect priorities, but maybe it's possible to make this) (#20240)
  2. This means all container operations now have to wait to get the lock
  3. We also added a feature whereby each time a "running containers probe" happens, it updates the "cost" on the API server (#19967)
  4. This means write operations on containers are now happening much much more frequently than just when containers change state
  5. As a result, requests involving containers are forced to wait in line, filling up the request queue and making everything slow.

On the plus side, the behavior of the dispatcher to back off when it sees 500 errors seems to be successfully keeping the system load from spiraling out of control.

This also suggests a short term fix for system load is to increase ProbeInterval.

Update:

Some supporting evidence:

  1. After Lucas adjusted ProbeInterval this morning, the concurrent requests are down.
  2. I was able to connect to the database and look at active queries. After changing ProbeInterval it is still the case that about 30%-40% of pending queries are "LOCK TABLE containers IN EXCLUSIVE mode"

Subtasks 1 (0 open1 closed)

Task #20460: Review 20447-less-table-lockingResolvedPeter Amstutz05/01/2023Actions
Actions #1

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #5

Updated by Brett Smith 11 months ago

  • Description updated (diff)
Actions #6

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #8

Updated by Peter Amstutz 11 months ago

  • Category set to API
Actions #9

Updated by Peter Amstutz 11 months ago

  • Subject changed from Container table busy to Container table lock contention
Actions #11

Updated by Peter Amstutz 11 months ago

  • Release set to 63
Actions #12

Updated by Peter Amstutz 11 months ago

  • Assigned To set to Tom Clegg
Actions #13

Updated by Tom Clegg 11 months ago

  • Status changed from New to In Progress
Actions #15

Updated by Peter Amstutz 11 months ago

I'm wondering if we could avoid taking the big lock on container request create/update, or at least defer it until the actual priority update happens?

Actions #16

Updated by Tom Clegg 11 months ago

I think locking the table after doing anything else that can conflict with a table lock (like "select for update") will end up causing deadlock.

I've updated the CR controller with a similar attribute whitelist, though, since name/description/etc updates don't cause cascading priority updates.

20447-less-table-locking @ b68d2d12f4dff73d371297688d84f32289c06907 -- developer-run-tests: #3625

I think the main thing here is for f4667c534 to remove the table lock in the case of cost updates, which happen frequently on every running container, whereas other container and CR updates typically happen O(1) times per container.

Actions #17

Updated by Peter Amstutz 11 months ago

Tom Clegg wrote in #note-16:

I think locking the table after doing anything else that can conflict with a table lock (like "select for update") will end up causing deadlock.

I've updated the CR controller with a similar attribute whitelist, though, since name/description/etc updates don't cause cascading priority updates.

20447-less-table-locking @ b68d2d12f4dff73d371297688d84f32289c06907 -- developer-run-tests: #3625

I think the main thing here is for f4667c534 to remove the table lock in the case of cost updates, which happen frequently on every running container, whereas other container and CR updates typically happen O(1) times per container.

You are probably right. Let's merge this and we can collect more data to see if it solves the main performance issues.

Actions #18

Updated by Tom Clegg 11 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF