Project

General

Profile

Actions

Bug #9427

closed

[Websockets] event queue backlogged

Added by Peter Amstutz almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

The websockets server currently uses the EventMachine library for multiplexing events. This is a single threaded event queue (despite the fact that we use it with Puma, which is a thread-per-connection web server).

We're seeing severe lag, where new connections respond slowly or not at all. This seems to be correlated if any connection is using the "catch up" functionality to replay past events. Upon instrumenting the server, we can see the performance:

2016-06-16_20:30:05.76986 #<Faye::WebSocket:0x000000049298d0> sending 28643742
2016-06-16_20:30:05.87824 #<Faye::WebSocket:0x000000049298d0> sent 28643742
2016-06-16_20:30:05.87832 #<Faye::WebSocket:0x000000049298d0> sending 28643743
2016-06-16_20:30:05.99874 #<Faye::WebSocket:0x000000049298d0> sent 28643743
2016-06-16_20:30:05.99882 #<Faye::WebSocket:0x000000049298d0> sending 28643745
2016-06-16_20:30:06.10723 #<Faye::WebSocket:0x000000049298d0> sent 28643745
2016-06-16_20:30:06.10731 #<Faye::WebSocket:0x000000049298d0> sending 28643746
2016-06-16_20:30:06.21399 #<Faye::WebSocket:0x000000049298d0> sent 28643746
2016-06-16_20:30:06.21407 #<Faye::WebSocket:0x000000049298d0> sending 28643749
2016-06-16_20:30:06.32665 #<Faye::WebSocket:0x000000049298d0> sent 28643749
2016-06-16_20:30:06.32675 #<Faye::WebSocket:0x000000049298d0> sending 28643750
2016-06-16_20:30:06.43588 #<Faye::WebSocket:0x000000049298d0> sent 28643750

In this trace it is spending about 100ms per invocation of websocket.send(). This is probably due to a slow or backlogged client connection. Because all events are currently processed in a single thread, this means the websocket server can't scale past the speed of the slowest client.

This may also be affected by a related performance problem: every event has to run through the permission checks to check if the user is allowed to read the event; as the number of groups readable by a user becomes large, this check becomes expensive.

Proposed solution:

We need a thread per connection for blocking operations (accessing the database, sending events).

It would be nice if we could create an EventMachine "reactor" per thread, however this doesn't appear to be possible.

EventMachine has thread pool functionality with "defer". We may able to make use of that instead of next_tick.


Subtasks 1 (0 open1 closed)

Task #9432: Review 9427-threaded-websocketsResolvedPeter Amstutz06/17/2016Actions

Related issues

Related to Arvados - Bug #9388: [Websockets] events are skippedResolvedPeter Amstutz06/10/2016Actions
Actions #1

Updated by Peter Amstutz almost 8 years ago

  • Subject changed from [Websockets] event queue gets backlogged to [Websockets] event queue backlogged
  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 8 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz almost 8 years ago

  • Description updated (diff)

Design sketch

Option 1:

We are using Puma, which is a thread-per-connection server. After hijacking the socket and initializing the websocket connection, set up a threadsafe Queue (akin to a Go channel). The EventMachine thread continues to handle receiving messages from the websockets and passes them to the appropriate queue. The thread at the other end of the queue generates the message response or publishes the notification. Having a single outgoing queue per connection means events on a single websocket are trivially serialized. The drawback is the number the number of connections is limited by the number of threads.

Option 2:

The current design sets up the websocket connections and hands it off to EventMachine and returns the request thread to Puma. EventMachine supports its own threadpool, so message responses and publishing notifications can be dispatched to the thread pool. The benefit is ability to support a greater number of connections than active threads. The drawback is more complex concurrency, several outgoing messages on the same connection could be posted to multiple threads, which would have to negotiate locking and queuing.

Actions #4

Updated by Tom Clegg almost 8 years ago

Option 1 sounds pretty good.

Without some safeguard, a client that receives too slowly would grow its outbound queue until the server goes OOM. Should we deal with this by disconnecting any client whose server-side queue has exceeded some configurable limit?

Actions #5

Updated by Peter Amstutz almost 8 years ago

  • Assigned To set to Peter Amstutz
  • Target version set to 2016-06-22 sprint
Actions #6

Updated by Tom Clegg almost 8 years ago

Can we also check the queue size when we send to the queue, instead of just when we receive from it? Once the queue size is 1000, there's no point adding anything else to it, because next time the client thread wakes up it will throw everything away and disconnect.

Max queue size should be configurable.

Should probably exit the socket event loop after sending the 500 error. (If I'm following correctly, current code will try to send a 500 error for each queued event until the queue size drops below threshold, which I assume is not what you intended to do after closing the socket...)

Actions #7

Updated by Tom Clegg almost 8 years ago

Max connection limit should be configurable.

Is it possible any of this is testable?

Actions #8

Updated by Tom Clegg almost 8 years ago

With the caveat that it's feeling very friday-5pmish right now... other than the above points this LGTM. It will be great to get out from under this "client A has to wait for client B's slow network" nonsense. Thanks.

Actions #9

Updated by Peter Amstutz almost 8 years ago

9427-threaded-websockets @ 27b18bf

  • Adds configuration variables websocket_max_connections, websocket_max_notify_backlog, websocket_max_filters, and websocket_db_pool
  • Adds initializer to adjust database connection pool size when running multithreaded websockets
  • Check backlog size and close backlogged connections when pushing, not popping
  • Increase wait time for tests (seems like there is some delay on initial websocket subscription, this may be due to creating new database connections).
Actions #10

Updated by Tom Clegg almost 8 years ago

I'm not keen on the amount of reaching into ActiveRecord's guts for the sake of having a different pool size depending on whether websocket server is enabled. If someone really wants different pool sizes running from the same config files, shouldn't they just put this in database.yml?

production:
  adapter: postgresql
  ...
  pool: 128

websocket:
  adapter: postgresql
  ...
  pool: 256

At the very least, the default in application.default.yml should be zero and signify "just use what I've configured in database.yml". But it would be even better if we could avoid the "application.yml overrides database.yml" situation entirely.

Actions #11

Updated by Peter Amstutz almost 8 years ago

Tom Clegg wrote:

I'm not keen on the amount of reaching into ActiveRecord's guts for the sake of having a different pool size depending on whether websocket server is enabled. If someone really wants different pool sizes running from the same config files, shouldn't they just put this in database.yml?

[...]

At the very least, the default in application.default.yml should be zero and signify "just use what I've configured in database.yml". But it would be even better if we could avoid the "application.yml overrides database.yml" situation entirely.

Since apparently "pool" in database.yml does work, backed this and added a comment to database.yml.example instead. Now @ fa615bf

Actions #12

Updated by Tom Clegg almost 8 years ago

LGTM, thanks!

Actions #13

Updated by Peter Amstutz almost 8 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:f17a26ca512ae0083ea5ad608ad6cfbb7fd247ee.

Actions

Also available in: Atom PDF