Bug #12087
closed[arvados-ws] Memory leak
Description
arvados-ws process caught using 40GB rss, 48GB virt, ~75% cpu.
debug.json:
{
"EventSource": {
"DBStats": {
"OpenConnections": 2
},
"EventsIn": 3479826,
"EventsOut": 54709791,
"Queue": 0,
"QueueDelay": 0.00097,
"QueueLimit": 4,
"Sinks": 44,
"SinksBlocked": 0
},
"HTTP": {
"ReqsReceived": 30041,
"ReqsActive": 45
},
"Outgoing": {
"QueueCount": 191,
"QueueMin": 0,
"QueueMax": 64,
"QueueTotal": 9204,
"QueueDelayMin": 0.000161,
"QueueDelayMax": 5.133126
}
}
An arbitrary log entry:
{
"RequestID": "bevbt12eapbb",
"elapsed": 872.1006494,
"level": "info",
"msg": "disconnect",
"stats": {
"QueueDelayNs": 10791625100,
"WriteDelayNs": 49772000,
"EventBytes": 904645,
"EventCount": 1519
},
"time": "2017-08-08T15:51:38.274834635Z"
}
A quick look at strace just shows an awful lot of clock_gettime() between useful-looking reads and writes. Perhaps some goroutines are getting stuck in busy loops?
Updated by Tom Clegg over 7 years ago
One way for a goroutine to get stuck in a busy loop is to read from a closed channel (which returns immediately) and stay in a loop.
for {
select {
case <-foo:
// do some stuff, but don't set foo to nil or break out of the loop
default:
// ...
}
}
If foo closes, this spins forever.
Updated by Tom Clegg over 7 years ago
SIGQUIT revealed many (thousands?) of goroutines like this
2017-08-08_16:52:46.63107 goroutine 3925946 [sleep]: 2017-08-08_16:52:46.63107 time.Sleep(0x5f5e100) 2017-08-08_16:52:46.63108 /usr/local/go/src/runtime/time.go:59 +0xf9 2017-08-08_16:52:46.63108 main.(*v0subscribe).sendOldEvents(0xc47732e870, 0xc45e0bdf00) 2017-08-08_16:52:46.63108 /tmp/tmp.P695OjRw0T/src/git.curoverse.com/arvados.git/services/ws/session_v0.go:207 +0x5ed 2017-08-08_16:52:46.63108 main.(*v0session).Receive(0xc45e0bdf00, 0xc48568e000, 0x73, 0x200000, 0xc44789a2d0, 0xc47732e820) 2017-08-08_16:52:46.63109 /tmp/tmp.P695OjRw0T/src/git.curoverse.com/arvados.git/services/ws/session_v0.go:87 +0x50f 2017-08-08_16:52:46.63109 main.(*handler).Handle.func2(0x913220, 0xc460831440, 0x913fe0, 0xc428b27a70, 0xc47732e550, 0xc44789a270, 0x9129e0, 0xc45e0bdf00) 2017-08-08_16:52:46.63109 /tmp/tmp.P695OjRw0T/src/git.curoverse.com/arvados.git/services/ws/handler.go:84 +0x33a 2017-08-08_16:52:46.63109 created by main.(*handler).Handle 2017-08-08_16:52:46.63110 /tmp/tmp.P695OjRw0T/src/git.curoverse.com/arvados.git/services/ws/handler.go:91 +0x4aa
I'd say this loop needs another exit condition, in case the client hangs up while sendq is full and the queue stays full forever.
for len(sess.sendq)*2 > cap(sess.sendq) {
// Ugly... but if we fill up the whole client
// queue with a backlog of old events, a
// single new event will overflow it and
// terminate the connection, and then the
// client will probably reconnect and do the
// same thing all over again.
time.Sleep(100 * time.Millisecond)
}
Updated by Tom Clegg over 7 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
- Target version set to 2017-08-16 sprint
12087-ws-busy-loop @ b61a80b93308582b91e89561dc373a1ea33658a9
Updated by Tom Clegg over 7 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:e4a35f3914481226a2a4ff57618bc486839aa2ea.
Updated by Nico César over 7 years ago
changes have been made in puppet for qr1hi su92l d8kp5 also e51c5 but puppet is disabled, until we have a downtime window to run it.