Bug #21540
closed
occasional container_requests deadlock
Added by Peter Amstutz about 1 year ago.
Updated 6 months ago.
Release relationship:
Auto
Description
arvados.cwl-runner: [container example] error submitting container
<HttpError 422 when requesting https://zzzzz.example.com/arvados/v1/container_requests/zzzzz-xvhdp-u047nzgnc3jdkd4?alt=json returned
"//railsapi.internal/arvados/v1/container_requests/zzzzz-xvhdp-u047nzgnc3jdkd4: 422 Unprocessable Entity: #<ActiveRecord::Deadlocked: PG::TRDeadlockDetected: ERROR: deadlock detected
DETAIL: Process 13792 waits for AccessExclusiveLock on tuple (801533,5) of relation 16562 of database 16400; blocked by process 31312.
Process 31312 waits for ShareLock on transaction 508183640; blocked by process 19685.
Process 19685 waits for ShareLock on transaction 508183565; blocked by process 13792.
HINT: See server log for query details. : select 1 from containers where containers.uuid in ( select pri_container_uuid from container_tree($1) UNION select container_requests.requesting_container_uuid from container_requests where container_requests.container_uuid = $1 and container_requests.state = 'Committed' and container_requests.requesting_container_uuid is not NULL ) order by containers.uuid for update > (req-4d3u0781f6rol0q7xux1)">
This is almost certainly a lock ordering issue. We should:
- Try to figure out what circumstances it happens and if it is fixable/avoidable
- Handle ActiveRecord::Deadlocked exceptions as 500 Internal Server Error so they are retried by the client
- Description updated (diff)
- Description updated (diff)
- Target version changed from Development 2024-03-27 sprint to Development 2024-03-13 sprint
- Target version changed from Development 2024-03-13 sprint to Future
- Subject changed from occasional container_requests deadlock to convert
- Subject changed from convert to occasional container_requests deadlock
- Related to Bug #21547: return certain database errors as 500 so they can be retried added
- Target version changed from Future to Development 2024-11-06 sprint
- Assigned To set to Peter Amstutz
- Related to Bug #22165: Priority update possibly locking container requests by accident added
- Status changed from New to In Progress
21540-priority-update-deadlock @ 78e23aba707bf8d8324c045c1fdf120c2e9bc082
developer-run-tests: #4531 
- All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
- Now catches and retries
ActiveRecord::Deadlocked
errors thrown in row_lock_for_priority_update
- Anything not implemented (discovered or discussed during work) has a follow-up story.
- no new tests, because I don't know how to trigger this condition
- Code is tested and passing, both automated and manual, what manual testing was done is described.
- New or changed UX/UX and has gotten feedback from stakeholders.
- Documentation has been updated.
- Behaves appropriately at the intended scale (describe intended scale).
- Should help with scaling since busier clusters presumably would see this error happen more often.
- Considered backwards and forwards compatibility issues between client and server.
- Follows our coding standards and GUI style guidelines.
I made the following changes:
- I tweaked the transaction to start with `select containers.uuid` instead of `select 1` on the off chance that `select 1` would cause the optimizer to ignore the underlying row.
- I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.
- I wrapped the statement with SAVEPOINT/ROLLBACK/retry to handle the ActiveRecord::Deadlocked exception. I did this because I can't confirm the previous two changes actually have any effect (my reading of the postgres docs is that it probably should have been doing the right thing all along, but it's hard to tell without going deep into the database). Because this statement only has one job (acquire the set of locks so records can be updated later) it is safe to rollback and retry getting the locks.
21540-priority-update-deadlock @ 124ffe90c721838f8d6004b7066abadf6011e5d5
developer-run-tests: #4540 
The failing test revealed a bug in the initial branch. That's what tests are for!
Because the SAVEPOINT meant that it required a transaction and that caused a failing test, which means there were code paths where it wasn't in a transaction, it's possible that not being in a transaction could be a root cause of the deadlocks.
If it is not in a transaction when it does the ordered locking, then the locks would be released immediately, and the update priority update query would re-take the locks in arbitrary order and potentially deadlock with overlapping operations, which is exactly what the ordered locking is intended to prevent.
This all LGTM except I'm not sure about this one:
I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.
But the comment (still) says:
# [...] This ensures we have locked
# everything that gets touched by either a priority update or state
# update.
I'm wary because it seems like the original intent was to lock the CR rows too. Perhaps because otherwise a concurrent CR update (like cancel) could cascade to the same container rows being updated, and cause deadlock...?
Tom Clegg wrote in #note-18:
This all LGTM except I'm not sure about this one:
I changed `for update` to `for update of containers` to make sure postgres only locked container rows and not container_request rows.
But the comment (still) says:
[...]
I'm wary because it seems like the original intent was to lock the CR rows too. Perhaps because otherwise a concurrent CR update (like cancel) could cascade to the same container rows being updated, and cause deadlock...?
- Updating the priority on a container request doesn't cascade changes to other container requests, only containers
- It does read container requests but it doesn't update them
- The original intent (when I wrote it) was not to lock container requests. Otherwise the select statement would have made an effort to apply ordered locking to container requests as well
- The priority update cascade happens after the container request record is updated, in `after_save`. When this happens the transaction already has the container request locked from the update.
(also, setting priority 0 on the container doesn't directly affect the container request, it only signals to the dispatcher to make another request which puts the container into cancelled state, at which point the container request is finalized).
That all makes sense, thanks. LGTM.
- Status changed from In Progress to Resolved
- Related to Bug #22476: Intermittent database deadlock with high load updating containers or container_requests added
Also available in: Atom
PDF