Project

General

Profile

Actions

Bug #21547

closed

return certain database errors as 500 so they can be retried

Added by Peter Amstutz about 1 year ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Story points:
-
Release relationship:
Auto

Description

Certain database errors represent transient errors. We should tell the client to retry the request by returning a 500 internal server error instead of 422 (which is the default behavior).

#<ActiveRecord::Deadlocked: PG::TRDeadlockDetected: ERROR: deadlock detected>

Rationale: The observed deadlocks in Arvados are conflicts between two statements (a lock ordering issue), so unwinding and retrying is a reasonable solution

#<ActiveRecord::StatementInvalid: PG::UnableToSend>

Rationale: It seems this gets thrown when the API server can't connect to the database.

Here's the list of postgres errors known to the PG gem:

https://github.com/ged/ruby-pg/blob/daec80f91b9519509ca1694a231f11a75cb43f7f/ext/errorcodes.def#L598

https://github.com/ged/ruby-pg/blob/daec80f91b9519509ca1694a231f11a75cb43f7f/ext/pg_errors.c#L88

Some other possible Exceptions to retry:

ConnectionBad
ConnectionException
ConnectionDoesNotExist
ConnectionFailure
TooManyConnections
CannotConnectNow
IdleSessionTimeout
ObjectInUse
LockNotAvailable
AdminShutdown
CrashShutdown

(There's a lot of connection related errors and I don't know the difference between them, but I included them all because it seems like those are very likely to be errors that occur through no fault of the client).


Subtasks 1 (0 open1 closed)

Task #21554: Review 21547-retryable-db-error ResolvedPeter Amstutz01/28/2025Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #21540: occasional container_requests deadlockResolvedPeter AmstutzActions
Actions #1

Updated by Peter Amstutz about 1 year ago

  • Related to Bug #21540: occasional container_requests deadlock added
Actions #2

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-03-13 sprint to Development 2024-03-27 sprint
Actions #6

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2024-03-27 sprint to Development 2024-04-10 sprint
Actions #7

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2024-04-10 sprint to Development 2024-04-24 sprint
Actions #8

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2024-04-24 sprint to Development 2024-05-08 sprint
Actions #9

Updated by Peter Amstutz 12 months ago

  • Target version changed from Development 2024-05-08 sprint to Development 2024-05-22 sprint
Actions #10

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-05-22 sprint to Development 2024-06-05 sprint
Actions #11

Updated by Peter Amstutz 11 months ago

  • Target version changed from Development 2024-06-05 sprint to Future
Actions #12

Updated by Peter Amstutz 2 months ago

  • Target version changed from Future to Development 2025-01-29
Actions #13

Updated by Peter Amstutz 2 months ago

  • Status changed from New to In Progress
Actions #17

Updated by Peter Amstutz 2 months ago

21547-retryable-db-error @ f24f6d7167c32dadc80f436fdbb4806d88808e0c

developer-run-tests: #4627

  • All agreed upon points are implemented / addressed. Describe changes from pre-implementation design.
    • Re-tries database errors. I didn't go all the way in and check the Postgresql error directly, but instead used the generic ActiveRecord errors. I believe that is good enough and the implementation is much simpler.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • n/a
  • Code is tested and passing, both automated and manual, what manual testing was done is described.
    • Tom helpfully contributed a test.
  • New or changed UX/UX and has gotten feedback from stakeholders.
    • n/a
  • Documentation has been updated.
    • n/a
  • Behaves appropriately at the intended scale (describe intended scale).
    • Should improve scale by making Arvados more robust to certain types of database errors
  • Considered backwards and forwards compatibility issues between client and server.
    • Returns a 500 error, which is in our list of retryable errors (_HTTP_CAN_RETRY = set([408, 409, 423, 500, 502, 503, 504]))
  • Follows our coding standards and GUI style guidelines.
    • yes
Actions #18

Updated by Tom Clegg 2 months ago

LGTM, thanks.

Actions #19

Updated by Peter Amstutz 2 months ago

  • Status changed from In Progress to Resolved
Actions #20

Updated by Peter Amstutz 2 months ago

  • Release set to 75
Actions

Also available in: Atom PDF