Bug #21540
closedoccasional container_requests deadlock
Description
arvados.cwl-runner: [container example] error submitting container <HttpError 422 when requesting https://zzzzz.example.com/arvados/v1/container_requests/zzzzz-xvhdp-u047nzgnc3jdkd4?alt=json returned "//railsapi.internal/arvados/v1/container_requests/zzzzz-xvhdp-u047nzgnc3jdkd4: 422 Unprocessable Entity: #<ActiveRecord::Deadlocked: PG::TRDeadlockDetected: ERROR: deadlock detected DETAIL: Process 13792 waits for AccessExclusiveLock on tuple (801533,5) of relation 16562 of database 16400; blocked by process 31312. Process 31312 waits for ShareLock on transaction 508183640; blocked by process 19685. Process 19685 waits for ShareLock on transaction 508183565; blocked by process 13792. HINT: See server log for query details. : select 1 from containers where containers.uuid in ( select pri_container_uuid from container_tree($1) UNION select container_requests.requesting_container_uuid from container_requests where container_requests.container_uuid = $1 and container_requests.state = 'Committed' and container_requests.requesting_container_uuid is not NULL ) order by containers.uuid for update > (req-4d3u0781f6rol0q7xux1)">
This is almost certainly a lock ordering issue. We should:
- Try to figure out what circumstances it happens and if it is fixable/avoidable
- Handle ActiveRecord::Deadlocked exceptions as 500 Internal Server Error so they are retried by the client
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2024-03-27 sprint to Development 2024-03-13 sprint
Updated by Peter Amstutz 11 months ago
- Target version changed from Development 2024-03-13 sprint to Future
- Subject changed from occasional container_requests deadlock to convert
Updated by Peter Amstutz 11 months ago
- Subject changed from convert to occasional container_requests deadlock
Updated by Peter Amstutz 11 months ago
- Related to Bug #21547: return certain database errors as 500 so they can be retried added
Updated by Peter Amstutz 3 months ago
- Target version changed from Future to Development 2024-11-06 sprint
Updated by Peter Amstutz 3 months ago
- Related to Bug #22165: Priority update possibly locking container requests by accident added
Updated by Peter Amstutz 3 months ago
21540-priority-update-deadlock @ 5c3bcec834b1c62f8685fff103c12da9197ea1cd
Updated by Peter Amstutz 3 months ago
21540-priority-update-deadlock @ 78e23aba707bf8d8324c045c1fdf120c2e9bc082
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Now catches and retries
ActiveRecord::Deadlocked
errors thrown inrow_lock_for_priority_update
- Now catches and retries
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- no new tests, because I don't know how to trigger this condition
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- passes existing tests
- New or changed UX/UX and has gotten feedback from stakeholders.
- n/a
- Documentation has been updated.
- n/a
- Behaves appropriately at the intended scale (describe intended scale).
- Should help with scaling since busier clusters presumably would see this error happen more often.
- Considered backwards and forwards compatibility issues between client and server.
- no issue
- Follows our coding standards and GUI style guidelines.
- yes
I made the following changes:
- I tweaked the transaction to start with `select containers.uuid` instead of `select 1` on the off chance that `select 1` would cause the optimizer to ignore the underlying row.
- I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.
- I wrapped the statement with SAVEPOINT/ROLLBACK/retry to handle the ActiveRecord::Deadlocked exception. I did this because I can't confirm the previous two changes actually have any effect (my reading of the postgres docs is that it probably should have been doing the right thing all along, but it's hard to tell without going deep into the database). Because this statement only has one job (acquire the set of locks so records can be updated later) it is safe to rollback and retry getting the locks.
Updated by Peter Amstutz 3 months ago
21540-priority-update-deadlock @ 124ffe90c721838f8d6004b7066abadf6011e5d5
The failing test revealed a bug in the initial branch. That's what tests are for!
Because the SAVEPOINT meant that it required a transaction and that caused a failing test, which means there were code paths where it wasn't in a transaction, it's possible that not being in a transaction could be a root cause of the deadlocks.
If it is not in a transaction when it does the ordered locking, then the locks would be released immediately, and the update priority update query would re-take the locks in arbitrary order and potentially deadlock with overlapping operations, which is exactly what the ordered locking is intended to prevent.
Updated by Peter Amstutz 3 months ago
I think it hit a flaky workbench2 test. Try again:
Updated by Tom Clegg 3 months ago
This all LGTM except I'm not sure about this one:
I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.
But the comment (still) says:
# [...] This ensures we have locked # everything that gets touched by either a priority update or state # update.
I'm wary because it seems like the original intent was to lock the CR rows too. Perhaps because otherwise a concurrent CR update (like cancel) could cascade to the same container rows being updated, and cause deadlock...?
Updated by Peter Amstutz 3 months ago
Tom Clegg wrote in #note-18:
This all LGTM except I'm not sure about this one:
I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.
But the comment (still) says:
[...]
I'm wary because it seems like the original intent was to lock the CR rows too. Perhaps because otherwise a concurrent CR update (like cancel) could cascade to the same container rows being updated, and cause deadlock...?
- Updating the priority on a container request doesn't cascade changes to other container requests, only containers
- It does read container requests but it doesn't update them
- The original intent (when I wrote it) was not to lock container requests. Otherwise the select statement would have made an effort to apply ordered locking to container requests as well
- The priority update cascade happens after the container request record is updated, in `after_save`. When this happens the transaction already has the container request locked from the update.
Updated by Peter Amstutz 3 months ago
(also, setting priority 0 on the container doesn't directly affect the container request, it only signals to the dispatcher to make another request which puts the container into cancelled state, at which point the container request is finalized).
Updated by Peter Amstutz 3 months ago
- Status changed from In Progress to Resolved