Bug #16217

[arvados-ws] Websocket server stops processing events, but stays connected

Added by Tom Clegg 9 months ago. Updated about 2 months ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)
Story points:
Release relationship:


Sometimes, after successfully processing hundreds or thousands of events, arvados-ws goes into a state where clients don't receive any events. The EventsIn number at /status.json is static, which indicates arvados-ws isn't receiving events from PostgreSQL.

Clients can still connect / stay connected, the once-per-minute empty "ping" message still works.

Cause is unknown.


Task #16230: Review 16217-ws-pingResolvedTom Clegg

Task #16231: Export event counters as metricsResolvedTom Clegg

Task #16232: [ops] Add arvados-ws to prometheus configsResolved

Task #16309: Review 16217-ws-metricsResolvedTom Clegg

Associated revisions

Revision d85da11d
Added by Tom Clegg 9 months ago

Merge branch '16217-ws-ping'

refs #16217

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 97e8290d
Added by Tom Clegg 8 months ago

Merge branch '16217-ws-metrics'

refs #16217

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>


#1 Updated by Peter Amstutz 9 months ago

  • Target version set to 2020-03-25 Sprint

#2 Updated by Tom Clegg 9 months ago

  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress

Not sure whether this is related to the observed failures but it seems worth fixing either way. Arvados-ws does a periodic listener ping, but hasn't been checking the returned error. With this change, if the ping fails, arvados-ws will log the error and exit/restart.

16217-ws-ping @ 9ebf73b1a1229bba507057ed2fb6a39635ce7e24 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1765/

#3 Updated by Lucas Di Pentima 9 months ago

16217-ws-ping LGTM, thanks!

#4 Updated by Peter Amstutz 8 months ago

  • Target version changed from 2020-03-25 Sprint to 2020-04-08 Sprint

#5 Updated by Tom Clegg 8 months ago

Replaces the old status/debug.json stuff with prometheus metrics. Also refactors services/ws to share service-startup code and distribute inside arvados-server like controller, boot, install, dispatchcloud, etc.

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1797/

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1798/

16217-ws-metrics @ 8d7a94c6799f20028725c1cc00614f1f7ae01209 -- https://ci.arvados.org/view/Developer/job/developer-run-tests/1800/

#6 Updated by Lucas Di Pentima 8 months ago

This LGTM, thanks!

#7 Updated by Tom Clegg 8 months ago

  • Status changed from In Progress to Resolved

#8 Updated by Peter Amstutz about 2 months ago

  • Release set to 25

Also available in: Atom PDF