Websocket server

(draft for v1 -- note this is not the v0 API implemented by arvados-ws as of July 2018.)

See also:

Messages

Each message is JSON-encoded as an object with exactly one key. The key indicates the message type, and the value contains the message content.

This allows clients and servers to decode messages efficiently: decode the first token to determine the message type, then (if the message content is relevant) decode the message payload into an appropriate data structure.

good: {"error":{"code":418,"text":"I'm a teapot"}}

bad:  {"errorCode":418,"errorText":"I'm a teapot"}

Clients must ignore any unrecognized keys they encounter in the payload. This allows the server to add features without breaking existing clients.

setAuth

After establishing a connection, and before subscribing to any streams, the client must supply an authorization token.

Successful authorization is acknowledged.

client: {
          "setAuth":{"token":"3kg6k6lzmp9kj5cpkcoxie963cmvjahbt2fod9zru30k1jqdmi"}
        }

server: {
          "auth":{"uuid":"zzzzz-gj3su-077z32aux8dg2s1"}
        }

Unsuccessful authorization results in an error.

client: {
          "setAuth":{
            "token":"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}}

server: {
          "authError":{
            "errorText":"invalid or expired token"}}

subscribe

Subscribe to an event stream.

If the given ETag does not match the current ETag, the server should send an update event right away: this means the client has already missed one or more updates since the version it has cached.

client: {
          "subscribe":{
            "uuid":"zzzzz-4zz18-1g4g0vhpjn9wq7i",
            "etag":"9u32836jpz7i046sd84gu190h"}}

server: {
          "event":{
            "msgID":12345,
            "type":"update",
            "uuid":"zzzzz-4zz18-1g4g0vhpjn9wq7i",
            "etag":"1wfdizt65l5w597jf5lojf8jm"}}

When a client subscribes to a stream X, but is not authorized to read the object with UUID X (or there is no such object), the server sends an error message. This does not terminate the connection, nor does it affect any other streams.

client: {
          "subscribe":{
            "uuid":"zzzzz-tpzed-000000000000000",
            "etag":"x"}}

server: {
          "subscribeError":{
            "uuid":"zzzzz-tpzed-000000000000000",
            "errorText":"forbidden"}}

Container and job logging events

Events API → "Non-state-changing events"

client: {
          "subscribe":{
            "uuid":"zzzzz-dz642-logscontainer03",
            "etag":"2qtm62j6zb3nx5zud8b5v0ayl",
            "select":["logs.event_type","logs.properties.text"]}}

server: {
          "event":{
            "msgID":12346,
            "type":"log",
            "uuid":"zzzzz-dz642-logscontainer03",
            "etag":"2qtm62j6zb3nx5zud8b5v0ayl",
            "log":{
              "event_type":"stderr",
              "properties":{
                "text":"foo\n"}}}}

Update events

server: {
          "event":{
            "msgID":12345,
            "type":"update",
            "uuid":"zzzzz-4zz18-1g4g0vhpjn9wq7i",
            "etag":"1wfdizt65l5w597jf5lojf8jm"}}

Create events

server: {
          "event":{
            "msgID":12345,
            "type":"create",
            "uuid":"zzzzz-4zz18-1g4g0vhpjn9wq7i",
            "etag":"1wfdizt65l5w597jf5lojf8jm"}}

Delete events

server: {
          "event":{
            "msgID":12345,
            "type":"delete",
            "uuid":"zzzzz-4zz18-1g4g0vhpjn9wq7i",
            "etag":"1wfdizt65l5w597jf5lojf8jm"}}

The etag reflects the last state of the object before it was deleted.

TBD: Should the etag be omitted instead?

Note: The logs table (and the old websocket API) use(d) a different event type: "destroy".

Missed events

Zero or more events for a single stream have been skipped:

server: {
          "eventsMissed":{
            "msgID":12347,
            "uuid":"zzzzz-dz642-logscontainer03"}}

Zero or more events on one or more of the subscribed streams have been skipped:

server: {
          "eventsMissed":{
            "msgID":12348}}

Server implementation

Architecture

Go server with a goroutine serving each connection.

One goroutine receives incoming events and assigns msgID numbers.

Each connection has an outgoing event queue. Leave room for ability to resize a connection's outgoing queue dynamically, provided no subscriptions are active: this way privileged clients can request bigger queues.

Common events should be serialized once and distributed to all connections. This avoids serializing each event N times, and allows outgoing queues to share a single message buffer for a given event.

If practical, when a connection's outgoing queue fills up, send a "missed events" signal and discard all buffered events (and, of course, any incoming events that arrive while the buffer is full). After a "missed events" signal the client needs to assume its cache is out of date anyway. Expect a faster recovery from a temporary backlog if, when skipping events, we skip as many as we can.

Logging

Print JSON-formatted log entries on stderr.

Print a log entry when a client connects.

Print a log entry when a client disconnects. Show counters for:
  • Number of streams (UUIDs) added while connection was up
  • Number of streams removed
  • Number of events sent
  • Number of bytes sent
  • Total time spent waiting for Write() to return (or a better way to measure congestion?)

Libraries

Websocket: PostgreSQL:

Problems with old/current implementation

(Lessons to avoid re-learning next time...)

The Rails API server can function as a websocket server. Clients (notably Workbench, arv-mount, arv-ws) use it to listen for events without polling.

Problems with current implementation:
  • Unreliable. See #9427, #8277
  • Resource-heavy (one postgres connection per connected client, uses lots of memory)
  • Logging is not very good
  • Updates look like database records instead of API responses (e.g., computed fields are missing, collection manifest_text has no signatures)
  • Offers an API for catching up on missed events after disconnecting/reconnecting, but this API (let alone the code) isn't enough to offer a "don't miss any events, don't send any events twice" guarantee. See #9388

#8460