Project

General

Profile

Actions

Bug #22476

open

Intermittent database deadlock with high load updating containers or container_requests

Added by Peter Amstutz 10 days ago. Updated 4 days ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-
Release relationship:
Auto

Description

It seems that #21540 wasn't fixed in 3.0 like we thought.


Related issues 1 (0 open1 closed)

Related to Arvados - Bug #21540: occasional container_requests deadlockResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz 10 days ago

  • Position changed from -935406 to -935394
  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 10 days ago

  • Related to Bug #21540: occasional container_requests deadlock added
Actions #3

Updated by Peter Amstutz 10 days ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 10 days ago

Container operations:

  • Container created to fulfill container request
    • takes row_lock_for_priority_update
  • Container log updated
    • takes row_lock_for_priority_update

Suspicious activity:

  • Updating container requests calls row_lock_for_priority_update @object.container_uuid but does not check if object.container_uuid is not nil. I had a suspicion that calling it with nil could select incorrect rows, but calling container_tree with null just seems to return a single null row.
  • there's a very stalled query select portable_data_hash from collections where uuid=$1

This very stalled query came from the keep server, so probably keep-web ???

After I killed the stalled query, things seem to have started moving again.

This is very odd.

Actions #5

Updated by Peter Amstutz 10 days ago

I'm trying to figure out where this query came from, and why it would cause a bunch of container update events to pile up.

Actions #6

Updated by Peter Amstutz 10 days ago

arvados/services/keep-web/handler.go:1127

err = db.QueryRowContext(ctx, `select portable_data_hash from collections where uuid=$1`, collection.UUID).Scan(&currentPDH)

This corresponds exactly to the suspect query.

I believe that queries in the Go API need to be closed explicitly.

From my reading of Scan(), I don't think it does that:

https://pkg.go.dev/database/sql#Rows.Scan

This gives me a tentative theory:

  1. keep-web checked a PDH of a log collection, and did not close the query. this leaves a read sharing lock on the row
  2. API server went to update that log collection, and got stuck waiting for the lock on the collection. This same request was also holding on to a container lock
  3. At some point, an update on the parent container request needed to take a bunch of locks, and got stuck on the previous container lock
  4. As a result, lots of containers can't update, because their locks are held
  5. API server gets severely backlogged because it has a bunch of stuck requests in its queue

I'm not 100% of ever step of this but it at least suggests some areas of further study and bug fixing.

Actions #7

Updated by Peter Amstutz 10 days ago

"keep-web is holding share locks on collections for indefinite periods" is also a plausible underlying cause for other weird behavior, e.g. #20605

Actions #8

Updated by Peter Amstutz 10 days ago

Ok, the evidence for this theory isn't holding up.

a) I believe I was misreading pg_stat_activity. It does include the last query and when that query started, but if the database process state is 'idle' then it isn't actually waiting for anything. I wasn't checking the 'state' column.

b) The single row Scan() operator does close the query:

https://cs.opensource.google/go/go/+/refs/tags/go1.23.5:src/database/sql/sql.go;drc=6de5a7180cc6459235895c76c792a7f15be5625d;l=3480

Actions #9

Updated by Peter Amstutz 10 days ago

It is still puzzling because terminating the process was correlated with the database starting to move along, but it may have been coincidence.

Actions #10

Updated by Peter Amstutz 10 days ago

Here's another theory:

arvados/lib/controller/localdb/container.go

containerPriorityUpdate

This does a bulk update of containers but doesn't do lock ordering, so it could interfere with other queries touching multiple containers.

Actions #11

Updated by Peter Amstutz 10 days ago

I can't find any evidence in the logs that containerPriorityUpdate is doing anything that coincides with the deadlocks happening.

Actions #12

Updated by Peter Amstutz 9 days ago

Container updates take locks on the parent container and all child containers.

Container request updates do the same thing (except getting container_uuid first).

There are a handful of updates that don't do that.

Periodic updates include: cost, state, runtime_status, log PDH.

Changes to log and state can propagate outward, so they take locks.

Collection.update_cr_logs is a little bit suspicious.

Updating for log involves updating the collection, and updating the container request.

It looks like updating the collection locks it for versioning. We don't actually want versioning when handling logs and output collections at this stage, so it might be a good idea to disable that, but I don't have a theory for why would cause a deadlock.

Actions #13

Updated by Peter Amstutz 7 days ago

handle_completed locks self, but I think this is redundant with the lock we took earlier.

retrying a request locks container requests.

Actions #14

Updated by Peter Amstutz 7 days ago

/var/www/arvados-api/current/lib/update_priorities.rb:15:in `row_lock_for_priority_update'
/var/www/arvados-api/current/lib/update_priorities.rb:54:in `block in update_priorities'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8.4/lib/active_record/connection_adapters/abstract/database_statements.rb:314:in `transaction'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8.4/lib/active_record/transactions.rb:209:in `transaction'
/var/www/arvados-api/current/lib/update_priorities.rb:52:in `update_priorities'
/var/www/arvados-api/current/app/models/container_request.rb:585:in `update_priority'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activesupport-7.0.8.4/lib/active_support/callbacks.rb:400:in `block in make_lambda'
...
api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8.4/lib/active_record/connection_adapters/abstract/database_statements.rb:316:in `transaction'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8.4/lib/active_record/transactions.rb:350:in `with_transaction_returning_status'
/var/www/arvados-api/shared/vendor_bundle/ruby/2.7.0/gems/activerecord-7.0.8.4/lib/active_record/persistence.rb:780:in `update!'
/var/www/arvados-api/current/app/controllers/application_controller.rb:125:in `update'
/var/www/arvados-api/current/app/controllers/arvados/v1/container_requests_controller.rb:59:in `update'
/var/www/arvados-

Actions #16

Updated by Peter Amstutz 7 days ago

422 Unprocessable Entity: ActiveRecord::Deadlocked: PG::TRDeadlockDetected: ERROR: deadlock detected

DETAIL: Process 5562 waits for AccessExclusiveLock on tuple (2307275,5) of relation 16562 of database 16400; blocked by process 18123.

Process 18123 waits for ShareLock on transaction 958216840; blocked by process 4847.

Process 4847 waits for ShareLock on transaction 958217012; blocked by process 5562.

HINT: See server log for query details.

Actions #17

Updated by Peter Amstutz 7 days ago

https://stackoverflow.com/questions/51070813/postgresql-deadlock-detected-select-for-update-in-transaction

One suggestion is to use the primary key in "select for update". Right now it is using uuid. Something about postgres taking share locks before trying to take exclusive locks.

Actions #18

Updated by Peter Amstutz 7 days ago

Another possible issue here is that part of the fix in #21540 was to re-try deadlock errors internally in the API server, but since that randomly waits 1-10 seconds it can tie up a request handler for quite a while, causing a traffic jam. Catching certain database errors so they are retryable on the client (#21547) might handle this more gracefully.

Actions #19

Updated by Peter Amstutz 7 days ago

After a bunch of research, it seems that "waits for ShareLock on transaction" means that one transaction is waiting on another transaction (which holds a lock) rather than just the individual lock.

So I have a tentative theory that catching the deadlock and rolling back to a savepoint could be much worse than aborting the transaction, if it means transaction A continues to wait on transaction B (because rolling back would release the row lock, but not necessarily release the transaction lock).

Actions #20

Updated by Peter Amstutz 6 days ago

The prevailing advice when struggling with transactions deadlocks seems to be "suck it up and retry the query".

Actions #21

Updated by Peter Amstutz 4 days ago

  • Release set to 75
Actions #22

Updated by Peter Amstutz 4 days ago

  • Target version changed from Development 2025-01-29 to Development 2025-02-12
Actions

Also available in: Atom PDF