Bug #5748

[Keep] sometimes, Keep sucks up 100% cpu and becomes really slow after a while

Added by Ward Vandewege over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Keep
Target version:
Start date:
05/06/2015
Due date:
% Done:

100%

Estimated time:
(Total: 4.00 h)
Story points:
-

Description

We have some keepstore servers that have r/o disks (only). After running for a few hours, the keepstored process becomes bloated (high ram) and pegged at (almost) 100% cpu. It still responds to requests, but very slowly. A slow memory leak?

root     24414 60.4 59.0 2491440 2243984 ?     Sl   14:21 263:13      \_ keepstore -enforce-permissions=false -permission-key-file=/etc/keepstore/permission.key -data-manager-token-file=/etc/keepstore/datamanager.key -listen=:25107 -serialize=true -volumes=/...

All CPU time is %user, not iowait:

# iostat -x
Linux 3.16-0.bpo.2-amd64 (keep5.su92l.arvadosapi.com)     04/15/2015     _x86_64_    (1 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.24    0.00    1.13    0.21    0.00   92.42

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.30    0.02    2.09     0.54    26.92    26.07     0.04   21.36    8.53   21.49   1.72   0.36
sdb               0.00     0.02    0.06    1.48     3.94   186.27   247.27     0.17  108.47    6.69  112.33   2.07   0.32
sdc               0.00     0.02    0.05    1.49     3.18   186.98   246.86     0.17  109.06    7.23  112.59   2.08   0.32
dm-0              0.00     0.00    0.06    1.50     3.94   186.27   244.24     0.25  157.84    8.05  163.45   2.09   0.32
dm-1              0.00     0.00    0.05    1.51     3.18   186.98   243.76     0.29  184.28    8.66  190.28   2.09   0.33
sdd               0.00     0.03    0.01    2.01     0.05   255.84   253.40     0.07   35.18   14.01   35.27   2.08   0.42
dm-2              0.00     0.00    0.01    2.04     0.05   255.84   249.51     0.26  129.09   29.02  129.51   2.06   0.42

Version:

# dpkg -l |grep keepstore
ii  keepstore                               0.1.20150413210544.0f119f7    amd64        Keepstore is the Keep storage daemon, accessible to clients on the LAN


Subtasks

Task #5933: Configurable limit on block buffers allocated at a given timeResolvedTom Clegg

Task #5925: Review 5748-keepstore-leakResolvedBrett Smith

Task #5950: review 5748-max-buffersResolvedPeter Amstutz

Task #6031: Review 5748-max-buffers-leakResolved


Related issues

Blocks Arvados - Bug #5662: [FUSE] cd/ls sometimes takes too long on su92lNew04/03/2015

Blocks Arvados - Bug #5713: [FUSE] File access sometimes takes too long on su92lNew04/13/2015

Associated revisions

Revision e7a2db3f
Added by Tom Clegg over 6 years ago

Merge branch '5748-keepstore-leak' refs #5748

Revision 7ea61e9c
Added by Tom Clegg over 6 years ago

Merge branch '5748-max-buffers' refs #5748

Revision a709514c
Added by Tom Clegg over 6 years ago

Merge branch '5748-max-buffers-leak-TC' refs #5748

Revision c6dd0227 (diff)
Added by Tom Clegg over 6 years ago

5748: Remove runtime.GC obsoleted by bufferpool. refs #5748

Revision 5151ef79 (diff)
Added by Tom Clegg over 6 years ago

5748: Remove unused import. refs #5748

History

#1 Updated by Ward Vandewege over 6 years ago

  • Description updated (diff)

#2 Updated by Ward Vandewege over 6 years ago

  • Description updated (diff)

#3 Updated by Ward Vandewege over 6 years ago

  • Subject changed from [Keep] in a scenario where Keep can't write to its disks, it sucks up 100% cpu and becomes really slow after a while to [Keep] sometimes, Keep sucks up 100% cpu and becomes really slow after a while

#4 Updated by Ward Vandewege over 6 years ago

This doesn't appear to be related to r/o disks, I have seen it happen on tb05z as well, where all disks can be written.

While it was happening there, datamanager was trying to do its thing, and failing (), and restarting, over and over and over (due to #5834).

#5 Updated by Tom Clegg over 6 years ago

  • Target version changed from Bug Triage to 2015-05-20 sprint

#6 Updated by Tom Clegg over 6 years ago

  • Category set to Keep
  • Assigned To set to Tom Clegg
  • Priority changed from Normal to High

#7 Updated by Tom Clegg over 6 years ago

  • Status changed from New to In Progress

#8 Updated by Tom Clegg over 6 years ago

a memory use bug

Currently, keepstore will allocate as many buffers as it has clients. There is no limit on the number of concurrent clients.
  • During a PUT request, the buffer grows (up to 64 mebibytes) while receiving data from the client, and of course has to last long enough to write the block to disk. When -serialize is enabled, this interval includes the "wait for serialize mutex" phase, which can last many seconds.
  • During a successful GET request, the buffer grows (up to 64 mebibytes) while reading data from disk and lasts long enough to verify the checksum and send the data to the client. Sending the data over the network can take many seconds (e.g., network congestion, slow client, slow network on the client side of a keepproxy).

If keepstore gets busy enough to exceed the available disk bandwidth, the PUT requests waiting for the serialize mutex can exhaust available RAM and the kernel will start swapping some buffers out to disk. At this point, each additional request will make all queued requests slower and less likely to finish at all.

Additionally, if a client times out and disconnects while its request is waiting in the queue, keepstore won't notice this: abandoned PUT requests waste memory just by staying in the queue, and (assuming the worst) get swapped in and written to a volume before the dead connection is noticed; meanwhile abandoned GET requests waste disk activity and (albeit for a short time, between reading and the first Write call, which presumably fails) memory.

proposed improvement

Rather than allocating them on the fly, use a pool1 of data buffers. The total number of buffers is configured with a command line flag. Whenever a handler needs a non-trivial buffer to store a data block -- for either PUT or GET -- it gets one from the pool. If the pool is empty, the request waits in a queue to get one.

This should be a reasonable way to limit keepstore's memory without unduly restricting operations that don't need much of it (like index, status, and GET 404).

1 probably based on http://golang.org/pkg/sync/#Pool -- maybe WaitingPool, wrapping Pool with Get() and Put() methods that use a sync.Cond to ensure size never goes above MaxSize.

#9 Updated by Brett Smith over 6 years ago

Reviewing 4d53c7e. This is good, just a couple cleanliness things.

  • There are a couple of references to old Index name still lurking around: in the last sentence of the IndexTo doc comment, and in gotCall() in the MockVolume implementation.
  • In IndexTo, the comment "Print filenames beginning with prefix" seems misplaced right above a condition that skips this loop. Would it improve readability to merge that condition into the previous one? Or if not, move the comment down above the Fprintf?

Assuming those are all patched straightforwardly, I'm happy to see this merged. Thanks.

#10 Updated by Peter Amstutz over 6 years ago

Reviewing 5748-max-buffers 978afd4

  • The default maxBuffers = 128 seems excessive (that's 8 GiB). With serialize=true, it should probably be a small multiple of the number of volumes (between 2 and 4 buffers per volume) since we can't do more concurrent work than we have volumes, and the goal here is to prevent keepstore from over committing on the number of requests it can handle at once (since it Gets the block from the buffer pool before taking the volume serialize lock).
  • Suggest moving bufferpool into the Go SDK, it might be useful for other services such as keepproxy.

#11 Updated by Tom Clegg over 6 years ago

Peter Amstutz wrote:

Reviewing 5748-max-buffers 978afd4

  • The default maxBuffers = 128 seems excessive (that's 8 GiB). With serialize=true, it should probably be a small multiple of the number of volumes (between 2 and 4 buffers per volume) since we can't do more concurrent work than we have volumes, and the goal here is to prevent keepstore from over committing on the number of requests it can handle at once (since it Gets the block from the buffer pool before taking the volume serialize lock).

Not exactly. The goal here is to limit memory use, and memory (unlike the serialize lock) stays occupied longer when clients send/receive slowly. If you have 8 GiB of RAM, and you are waiting for 100 clients to receive the response data for their GET requests, you've got 6.4GiB of block data stored in buffers but the serialize locks are all free. Similarly, while you are waiting to receive PUT body content from 100 clients, you need buffers but you don't need any serialize locks yet. Essentially the minimum buffers:spindles ratio corresponds to the diskspeed:networkspeed ratio, and (especially since keepproxy pipelines reads instead of buffering them) that ratio can easily be much higher than 2 or 4. (If we could assume/detect that the network bottleneck is the keepstore node itself, it would be reasonable to stall new clients to help older clients finish, but we can't/don't.)

I thought "total system RAM ÷ block size" would be a good default, and a way to specify "50%" like sort(1) (perhaps based on https://godoc.org/github.com/cloudfoundry/gosigar#Mem.Get) would probably be handy, but rather than overthink/predict this I figured we should start with a simple config knob and a probably-not-too-low default, and get operational feedback.

I figured it would be better to err on the "too big" side because the sysadmin story "keepstore is using too much ram" seems easier to understand than "clients are starved but keepstore looks idle". (Hm, perhaps the "waiting..." message should recommend increasing -max-buffers?)

  • Suggest moving bufferpool into the Go SDK, it might be useful for other services such as keepproxy.

True. I thought about this too, but figured we can do those things together (move it to SDK and adopt it in keepproxy) / we might confirm how well it works before we arrange too much more code around it...

#12 Updated by Peter Amstutz over 6 years ago

Tom Clegg wrote:

I figured it would be better to err on the "too big" side because the sysadmin story "keepstore is using too much ram" seems easier to understand than "clients are starved but keepstore looks idle". (Hm, perhaps the "waiting..." message should recommend increasing -max-buffers?)

That's true, it's easy enough to configure the usage, the important thing will be to communicate to our admins that setting -max-buffers is not really optional, if it's not going to be smart about adjusting its default buffers to the RAM size of the system.

  • Suggest moving bufferpool into the Go SDK, it might be useful for other services such as keepproxy.

True. I thought about this too, but figured we can do those things together (move it to SDK and adopt it in keepproxy) / we might confirm how well it works before we arrange too much more code around it...

Ok, we should file a separate story for that, I think there have been some issues with memory usage of Keepproxy as well.

#13 Updated by Tom Clegg over 6 years ago

Peter Amstutz wrote:

That's true, it's easy enough to configure the usage, the important thing will be to communicate to our admins that setting -max-buffers is not really optional, if it's not going to be smart about adjusting its default buffers to the RAM size of the system.

True. Added a note to http://doc.arvados.org/install/install-keepstore.html. (Probably a good idea to point this out even if/when we do default to total system RAM, in case the node is expected to have memory available for anything else.)

Ok, we should file a separate story for that, I think there have been some issues with memory usage of Keepproxy as well.

Added #5962.

#14 Updated by Tom Clegg over 6 years ago

  • Status changed from In Progress to Feedback

#15 Updated by Peter Amstutz over 6 years ago

  • Status changed from Feedback to In Progress

aside from above comments LGTM, please merge

#16 Updated by Tom Clegg over 6 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF