Bug #16217

[arvados-ws] Websocket server stops processing events, but stays connected

Added by Tom Clegg 28 days ago. Updated 1 day ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
03/12/2020
Due date:
% Done:

33%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Sometimes, after successfully processing hundreds or thousands of events, arvados-ws goes into a state where clients don't receive any events. The EventsIn number at /status.json is static, which indicates arvados-ws isn't receiving events from PostgreSQL.

Clients can still connect / stay connected, the once-per-minute empty "ping" message still works.

Cause is unknown.


Subtasks

Task #16230: Review 16217-ws-pingResolvedTom Clegg

Task #16231: Export event counters as metricsIn ProgressTom Clegg

Task #16232: [ops] Add arvados-ws to prometheus configsNew

Associated revisions

Revision d85da11d
Added by Tom Clegg 21 days ago

Merge branch '16217-ws-ping'

refs #16217

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Peter Amstutz 27 days ago

  • Target version set to 2020-03-25 Sprint

#2 Updated by Tom Clegg 25 days ago

  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress

Not sure whether this is related to the observed failures but it seems worth fixing either way. Arvados-ws does a periodic listener ping, but hasn't been checking the returned error. With this change, if the ping fails, arvados-ws will log the error and exit/restart.

16217-ws-ping @ 9ebf73b1a1229bba507057ed2fb6a39635ce7e24 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1765/

#3 Updated by Lucas Di Pentima 21 days ago

16217-ws-ping LGTM, thanks!

#4 Updated by Peter Amstutz 8 days ago

  • Target version changed from 2020-03-25 Sprint to 2020-04-08 Sprint

#5 Updated by Tom Clegg 1 day ago

Replaces the old status/debug.json stuff with prometheus metrics. Also refactors services/ws to share service-startup code and distribute inside arvados-server like controller, boot, install, dispatchcloud, etc.

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1797/

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1798/

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1800/

Also available in: Atom PDF