Project

General

Profile

Actions

Bug #21540

closed

occasional container_requests deadlock

Added by Peter Amstutz 11 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Story points:
-
Release:
Release relationship:
Auto

Description

arvados.cwl-runner: [container example] error submitting container

<HttpError 422 when requesting https://zzzzz.example.com/arvados/v1/container_requests/zzzzz-xvhdp-u047nzgnc3jdkd4?alt=json returned

"//railsapi.internal/arvados/v1/container_requests/zzzzz-xvhdp-u047nzgnc3jdkd4: 422 Unprocessable Entity: #<ActiveRecord::Deadlocked: PG::TRDeadlockDetected: ERROR: deadlock detected

DETAIL: Process 13792 waits for AccessExclusiveLock on tuple (801533,5) of relation 16562 of database 16400; blocked by process 31312.

Process 31312 waits for ShareLock on transaction 508183640; blocked by process 19685.

Process 19685 waits for ShareLock on transaction 508183565; blocked by process 13792.

HINT: See server log for query details. : select 1 from containers where containers.uuid in ( select pri_container_uuid from container_tree($1) UNION select container_requests.requesting_container_uuid from container_requests where container_requests.container_uuid = $1 and container_requests.state = 'Committed' and container_requests.requesting_container_uuid is not NULL ) order by containers.uuid for update > (req-4d3u0781f6rol0q7xux1)">

This is almost certainly a lock ordering issue. We should:

  1. Try to figure out what circumstances it happens and if it is fixable/avoidable
  2. Handle ActiveRecord::Deadlocked exceptions as 500 Internal Server Error so they are retried by the client

Subtasks 1 (0 open1 closed)

Task #22262: Review 21540-priority-update-deadlockResolvedTom Clegg10/30/2024Actions

Related issues 2 (1 open1 closed)

Related to Arvados - Bug #21547: return certain database errors as 500 so they can be retriedNewPeter AmstutzActions
Related to Arvados - Bug #22165: Priority update possibly locking container requests by accidentResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-03-13 sprint
Actions #4

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-03-13 sprint to Future
  • Subject changed from occasional container_requests deadlock to convert
Actions #5

Updated by Peter Amstutz 11 months ago

  • Subject changed from convert to occasional container_requests deadlock
Actions #6

Updated by Peter Amstutz 11 months ago

  • Related to Bug #21547: return certain database errors as 500 so they can be retried added
Actions #7

Updated by Peter Amstutz 3 months ago

  • Target version changed from Future to Development 2024-11-06 sprint
Actions #9

Updated by Peter Amstutz 3 months ago

  • Assigned To set to Peter Amstutz
Actions #10

Updated by Peter Amstutz 3 months ago

  • Related to Bug #22165: Priority update possibly locking container requests by accident added
Actions #12

Updated by Peter Amstutz 3 months ago

  • Status changed from New to In Progress
Actions #13

Updated by Peter Amstutz 3 months ago

21540-priority-update-deadlock @ 78e23aba707bf8d8324c045c1fdf120c2e9bc082

developer-run-tests: #4531

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • Now catches and retries ActiveRecord::Deadlocked errors thrown in row_lock_for_priority_update
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • no new tests, because I don't know how to trigger this condition
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • passes existing tests
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • Should help with scaling since busier clusters presumably would see this error happen more often.
  • Considered backwards and forwards compatibility issues between client and server.
    • no issue
  • Follows our coding standards and GUI style guidelines.
    • yes

I made the following changes:

  • I tweaked the transaction to start with `select containers.uuid` instead of `select 1` on the off chance that `select 1` would cause the optimizer to ignore the underlying row.
  • I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.
  • I wrapped the statement with SAVEPOINT/ROLLBACK/retry to handle the ActiveRecord::Deadlocked exception. I did this because I can't confirm the previous two changes actually have any effect (my reading of the postgres docs is that it probably should have been doing the right thing all along, but it's hard to tell without going deep into the database). Because this statement only has one job (acquire the set of locks so records can be updated later) it is safe to rollback and retry getting the locks.
Actions #15

Updated by Peter Amstutz 3 months ago

21540-priority-update-deadlock @ 124ffe90c721838f8d6004b7066abadf6011e5d5

developer-run-tests: #4540

The failing test revealed a bug in the initial branch. That's what tests are for!

Because the SAVEPOINT meant that it required a transaction and that caused a failing test, which means there were code paths where it wasn't in a transaction, it's possible that not being in a transaction could be a root cause of the deadlocks.

If it is not in a transaction when it does the ordered locking, then the locks would be released immediately, and the update priority update query would re-take the locks in arbitrary order and potentially deadlock with overlapping operations, which is exactly what the ordered locking is intended to prevent.

Actions #16

Updated by Peter Amstutz 3 months ago

I think it hit a flaky workbench2 test. Try again:

developer-run-tests: #4542

Actions #17

Updated by Peter Amstutz 3 months ago

  • Release set to 70
Actions #18

Updated by Tom Clegg 3 months ago

This all LGTM except I'm not sure about this one:

I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.

But the comment (still) says:

  # [...] This ensures we have locked                                                                                       
  # everything that gets touched by either a priority update or state                                                                               
  # update.

I'm wary because it seems like the original intent was to lock the CR rows too. Perhaps because otherwise a concurrent CR update (like cancel) could cascade to the same container rows being updated, and cause deadlock...?

Actions #19

Updated by Peter Amstutz 3 months ago

Tom Clegg wrote in #note-18:

This all LGTM except I'm not sure about this one:

I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.

But the comment (still) says:

[...]

I'm wary because it seems like the original intent was to lock the CR rows too. Perhaps because otherwise a concurrent CR update (like cancel) could cascade to the same container rows being updated, and cause deadlock...?

  1. Updating the priority on a container request doesn't cascade changes to other container requests, only containers
  2. It does read container requests but it doesn't update them
  3. The original intent (when I wrote it) was not to lock container requests. Otherwise the select statement would have made an effort to apply ordered locking to container requests as well
  4. The priority update cascade happens after the container request record is updated, in `after_save`. When this happens the transaction already has the container request locked from the update.
Actions #20

Updated by Peter Amstutz 3 months ago

(also, setting priority 0 on the container doesn't directly affect the container request, it only signals to the dispatcher to make another request which puts the container into cancelled state, at which point the container request is finalized).

Actions #21

Updated by Tom Clegg 3 months ago

That all makes sense, thanks. LGTM.

Actions #22

Updated by Peter Amstutz 3 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF