Bug #12087

[arvados-ws] Memory leak

Added by Tom Clegg about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
API
Target version:
Start date:
08/08/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

arvados-ws process caught using 40GB rss, 48GB virt, ~75% cpu.

debug.json:

{
  "EventSource": {
    "DBStats": {
      "OpenConnections": 2
    },
    "EventsIn": 3479826,
    "EventsOut": 54709791,
    "Queue": 0,
    "QueueDelay": 0.00097,
    "QueueLimit": 4,
    "Sinks": 44,
    "SinksBlocked": 0
  },
  "HTTP": {
    "ReqsReceived": 30041,
    "ReqsActive": 45
  },
  "Outgoing": {
    "QueueCount": 191,
    "QueueMin": 0,
    "QueueMax": 64,
    "QueueTotal": 9204,
    "QueueDelayMin": 0.000161,
    "QueueDelayMax": 5.133126
  }
}

An arbitrary log entry:

{
  "RequestID": "bevbt12eapbb",
  "elapsed": 872.1006494,
  "level": "info",
  "msg": "disconnect",
  "stats": {
    "QueueDelayNs": 10791625100,
    "WriteDelayNs": 49772000,
    "EventBytes": 904645,
    "EventCount": 1519
  },
  "time": "2017-08-08T15:51:38.274834635Z" 
}

A quick look at strace just shows an awful lot of clock_gettime() between useful-looking reads and writes. Perhaps some goroutines are getting stuck in busy loops?


Subtasks

Task #12088: Review 12087-ws-busy-loopResolvedTom Clegg

Associated revisions

Revision e4a35f39
Added by Tom Clegg about 2 years ago

Merge branch '12087-ws-busy-loop'

closes #12087

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg about 2 years ago

One way for a goroutine to get stuck in a busy loop is to read from a closed channel (which returns immediately) and stay in a loop.

for {
  select {
  case <-foo:
    // do some stuff, but don't set foo to nil or break out of the loop
  default:
    // ...
  }
}

If foo closes, this spins forever.

#2 Updated by Tom Clegg about 2 years ago

SIGQUIT revealed many (thousands?) of goroutines like this

2017-08-08_16:52:46.63107 goroutine 3925946 [sleep]:
2017-08-08_16:52:46.63107 time.Sleep(0x5f5e100)
2017-08-08_16:52:46.63108       /usr/local/go/src/runtime/time.go:59 +0xf9
2017-08-08_16:52:46.63108 main.(*v0subscribe).sendOldEvents(0xc47732e870, 0xc45e0bdf00)
2017-08-08_16:52:46.63108       /tmp/tmp.P695OjRw0T/src/git.curoverse.com/arvados.git/services/ws/session_v0.go:207 +0x5ed
2017-08-08_16:52:46.63108 main.(*v0session).Receive(0xc45e0bdf00, 0xc48568e000, 0x73, 0x200000, 0xc44789a2d0, 0xc47732e820)
2017-08-08_16:52:46.63109       /tmp/tmp.P695OjRw0T/src/git.curoverse.com/arvados.git/services/ws/session_v0.go:87 +0x50f
2017-08-08_16:52:46.63109 main.(*handler).Handle.func2(0x913220, 0xc460831440, 0x913fe0, 0xc428b27a70, 0xc47732e550, 0xc44789a270, 0x9129e0, 0xc45e0bdf00)
2017-08-08_16:52:46.63109       /tmp/tmp.P695OjRw0T/src/git.curoverse.com/arvados.git/services/ws/handler.go:84 +0x33a
2017-08-08_16:52:46.63109 created by main.(*handler).Handle
2017-08-08_16:52:46.63110       /tmp/tmp.P695OjRw0T/src/git.curoverse.com/arvados.git/services/ws/handler.go:91 +0x4aa

I'd say this loop needs another exit condition, in case the client hangs up while sendq is full and the queue stays full forever.

        for len(sess.sendq)*2 > cap(sess.sendq) {
            // Ugly... but if we fill up the whole client
            // queue with a backlog of old events, a
            // single new event will overflow it and
            // terminate the connection, and then the
            // client will probably reconnect and do the
            // same thing all over again.
            time.Sleep(100 * time.Millisecond)
        }

#3 Updated by Tom Clegg about 2 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
  • Target version set to 2017-08-16 sprint

#4 Updated by Peter Amstutz about 2 years ago

LGTM.

#5 Updated by Tom Clegg about 2 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:e4a35f3914481226a2a4ff57618bc486839aa2ea.

#6 Updated by Nico C├ęsar about 2 years ago

changes have been made in puppet for qr1hi su92l d8kp5 also e51c5 but puppet is disabled, until we have a downtime window to run it.

Also available in: Atom PDF