Bug #17926

[controller] lib/pq 1.3.0 does not handle stale db connections properly (Aurora RDS)

Added by Ward Vandewege 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/20/2021
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

Context: Arvados cluster with Aurora RDS as db backend.

Symptom: After the cluster has been idle for a while, a fresh login fails with a "broken pipe" error. The logs say


{"PID":14505,"RequestID" :"req-22mvdy7j9r6di9xzn6os","level”:“info”, "msg":"response”, "remoteAddr”:"127.0.0.1:47966", "reqBytes":38,"reqForwardedFor":"1.2.3.4", “reqHost":"somewhere.
over.the.rainbow", “reqMethod": "POST", “reqPath":"arvados/v1/users/authenticate",“reqQuery":"","respBody":"{\"errors\":[\"w
rite tcp 9.1.2.3:57210-\\u003e5.6.7.8:5432: write: broken pipe\"]}\n","respBytes":91, respStatus":"Internal
Server Error”,"respStatusCode” :500, “time” :"2021-07-207T15:57:14.8873462372", “timeToStatus":0.177528, “timeTotal”:0.177538, "timeWriteBody":0.000018}

Likely cause: a bug in `lib/pq`, as described here: https://blog.bossylobster.com/2020/12/broken-pipe.html

The fix has been merged and is available in version 1.10.0 and up, but we are on version 1.3.0.


Subtasks

Task #17927: review 17962-bump-lib-pqResolvedPeter Amstutz

History

#1 Updated by Ward Vandewege 2 months ago

  • Status changed from New to In Progress

#2 Updated by Ward Vandewege 2 months ago

  • Description updated (diff)

#3 Updated by Peter Amstutz 2 months ago

  • Release set to 41

#4 Updated by Ward Vandewege 2 months ago

  • Description updated (diff)

#6 Updated by Peter Amstutz 2 months ago

Ward Vandewege wrote:

Ready for review at 004f220a006e4e9716ad6f229e5e3721090d44f0 on branch 17962-bump-lib-pq

Tests passed at https://ci.arvados.org/view/Developer/job/developer-run-tests/2598/

LGTM

#7 Updated by Ward Vandewege 2 months ago

Fix is merged (though I typo'd the issue number in the git commits as 17962 instead of 17926...), waiting for confirmation that it fixes the problem.

#8 Updated by Ward Vandewege 2 months ago

  • Status changed from In Progress to Resolved

The fix appears to work, the bug was no longer observed. Resolving this ticket.

Also available in: Atom PDF