Bug #7329
closed[Keep] Keepstore wastes lots of cycles handling a request for the empty block
Description
On many Keep servers on qr1hi right now, we see the keepstore process using all of 1 CPU, almost evenly split between system and user calls. All of these keepstore processes have the empty block file open, and are keeping it open. All of the CPU thrashing is apparently degrading performance for subsequent requests; running arv-gets are very slow, until you restart the keepstores, and then they run fine again.
Three Keep servers do not have the empty block (5-7). They are performing fine.
This looks like a regression introduced to Keep between 5dd20e4 and 9a34f5e.
The branch that fixes this is expected to include at least one failing test to tickle the bug. The test might fail by causing the tested code to enter an infinite loop.
Updated by Brett Smith over 9 years ago
- Target version set to Arvados Future Sprints
Updated by Nico César over 9 years ago
- Target version deleted (
Arvados Future Sprints)
just an update:
keep6.qr1hi has (old version)
keep6.qr1hi:/home/nico# apt-cache policy keepstore|grep Installed Installed: 0.1.20150811175109.5dd20e4
(puppet agent didn't run because /var/lib/puppet/state/agent_catalog_run.lock exists)
but /data/qr1hi-keep-2/keep/d41/d41d8cd98f00b204e9800998ecf8427e is there
That's not wedged
Updated by Nico César over 9 years ago
- Target version set to Arvados Future Sprints
Updated by Nico César over 9 years ago
did a manual downgrade and restarted the service until we have a newer version without the problem
apt-get install keepstore=0.1.20150811175109.5dd20e4 sv restart keepstore
Updated by Ward Vandewege over 9 years ago
Yeah... that's a fairly sizable diff:
$ git diff 5dd20e4...9a34f5e services/keepstore |wc -l 1596
Updated by Brett Smith over 9 years ago
- Target version deleted (
Arvados Future Sprints)
Ward Vandewege wrote:
Yeah... that's a fairly sizable diff:
A lot of them are comments or test updates, though. Here are the only commits that actually touch code in that range:
Updated by Radhika Chippada over 9 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada over 9 years ago
commit c447d4a79f653bbf2c172a0a715d30db896a4a32
It seems like the issue is happening in UnixVolume -> Compare method. When an empty block file is read, the EOF is not being returned. Due to this the Compare method is falling in an infinite loop.
This seems like a go issue. However, I did not find out any information as to why EOF is being returned. I noticed that EOF is returned for all other cases involving non-empty blocks.
For the time being, I put in a quick fix with an exclusive else block to handle the case where the empty block is being compared. This helped me continue writing tests (which block without the code update if you would like to see the error).
Also, wrote several tests. Added empty block tests for volume_unix_test.go and keepstore_test.go. Also, added a brand new integration test for keepstore using unix volumes since we do not have any such tests. This should greatly help writing more and more integration test with unix volumes, rather than using mock volumes.
Also made a couple gofmt and golint updates in the keepstore implementation.
Updated by Tom Clegg about 9 years ago
I think the issue here is that rdr.Read(buf)
doesn't return EOF if len(buf)==0
-- which seems reasonable (as in "our bug, not Go's bug"), in that you shouldn't expect to reach EOF if you don't advance your position in the file.
I've committed an alternate fix in 9bd1f60: never use a non-empty buffer, regardless of file size. This way we avoid the infinite loop any time the file on disk is empty -- even if expect
is not empty. (Also, this way, the rest of the loop logic just works as is: we can still handle "file content doesn't match expected content" for zero-length files without special cases.)
TestKeepStoreGetEmptyBlockNotExists
seems to require that GetBlock return an error for "get empty block" when nobody has written the empty block yet. Is there some reason we want to guarantee that behavior? It seems to me that "return a zero-length block" should also be considered correct. Stepping back, we're using mock volumes in these tests, so we're testing that we don't incorrectly handle a special case that we don't actually handle as a special case at all.
keepstore_integration_unix_volume_test.go
seems to have a lot of code duplication, and no integration tests. AFAICT the bug at hand is "if file on disk is empty, everything goes wrong", and "file on disk is empty" sounds like the sort of failure mode we can expect other volume types to encounter as well. I've added a bunch of "empty block" tests to volume_generic_test.go
in 67c6f66. With that, are the new tests in keepstore_integration_unix_volume_test.go
still relevant? The new tests in volume_unix_test.go
also seem partly (perhaps even wholly?) redundant. If they're testing certain code paths or behaviors that can't be covered by the generic tests, we should add comments to make it obvious what those are...
Updated by Radhika Chippada about 9 years ago
Tom, thanks for the alternate fix.
Regarding your comment about tests being redundant: per our discussion, I added functional generic tests that can be tested with any volume implementation, that use PutBlock and GetBlock for testing. With that in place, I removed the redundant tests as well as the keepstore_integration_unix_volume_test.go file.
Thanks.
Updated by Tom Clegg about 9 years ago
"A TestableVolumeManagerFactory creates a volume manager with one or more TestableVolumes." The tests seem to expect it to return exactly two testable volumes. Perhaps it should say "at least two" and the "defer Teardown" statements should be a loop:
- defer testableVolumes[0].Teardown() - defer testableVolumes[1].Teardown() + for _, v := range testableVolumes { + defer v.Teardown() + }
I think it would be less error-prone to have the Factory return the VolumeManager as well as the slice of testable volumes, and assign the VolumeManager to the global KeepVM in testGetBlock etc instead of in the Factory. (It's unfortunate that KeepVM is a global in the first place, but at least we can keep the "change global to X for the duration of a test" stuff inside the relevant test function.)
(As discussed in chat) rename volume_keepstore_generic_test.go to handler_with_generic_volume_test.go
Updated by Radhika Chippada about 9 years ago
Addressed all those comments in commit fdbf2ead3452bf7c8ba8f3f275cf486339b5406e
Updated by Radhika Chippada about 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:33596fce13c3bd5c368f5efeb89d625f684a622d.