Story #6260

[Keep] Integration test between data manager, API and keepstore.

Added by Peter Amstutz over 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Keep
Target version:
Start date:
08/19/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

Write one or more integration tests to verify that Data Manager's existing delete functionality (i.e., deletes blocks that are unreferenced; does not do anything about overreplicated blocks) works as desired:

  • Verify that blocks not referenced in a collection are deleted from keepstore
  • Verify that all blocks referenced from collections, and all blocks newer than the block signature TTL, are never deleted from keepstore
Minimal test, covering a miniature version of normal operation:
  1. bring up api and keepstore services (just like we do already in keepproxy_test.go)
  2. store some collections (with non-zero data)
  3. store some data blocks without referencing them in any collection
  4. back-date the block Mtimes so that keepstore will consider unreferenced data old enough to delete
  5. write some "transient" blocks as well, this time without back-dating their Mtimes
  6. get block index from all keepstores
  7. run data manager in "single run" mode
  8. wait for all keepstores to finish working their trash and pull lists (i.e., /status.json reports status["PullQueue"]["Queued"]==0 && status["PullQueue"]["InProgress"]==0 && ...)
  9. get block index from all keepstores, make sure nothing has been deleted except the blocks that are back-dated and unreferenced
  10. make API calls to delete some of the collections
  11. reduce replication on some of the collections
  12. run data manager again in "single run" mode
  13. wait for all keepstores to finish working their trash and pull lists
  14. get block index from all keepstores, make sure:
    • all blocks appearing in non-deleted collections were not deleted
    • all non-recent blocks appearing only in deleted collections were deleted
Along the way, the test suite must confirm that the test data includes
  • some blocks that appear in at least one "non-deleted" and at least one "deleted" collection
  • some blocks that appear in at least one "deleted" collection and no "non-deleted" collections, and are recent (i.e., written in the "transient" step, and therefore are not garbage)
  • some blocks that never appeared in any collections, but are recent
  • some blocks that never appeared in any collections, and are not recent
  • some blocks that do appear in collections, and are not recent (i.e., being referenced by a collection is the only reason they're not garbage)
  • of course, some blocks that actually get garbage-collected (i.e., the "garbage" set must not be empty!)
Minimal set of error cases to test:
  • Network error when getting collection list from API server. Must stop without deleting anything.
  • Configure with a token that is valid, but does not belong to an admin user (so "list collections" API will succeed, but not all collections will be returned). Must stop without deleting anything.

Ideally, the assessment of whether a block has been "deleted" should compare desired to actual replication level -- this way, the test won't start failing when data manager starts deleting some copies of over-replicated non-garbage blocks. But if this part threatens to be non-trivial, we should defer: better to get the issue at hand done, and adjust when needed.

When this is done and we are satisfied it's effective, keepstore will no longer need to force never_delete=true. See #6221


Subtasks

Task #7051: Review 6260-test-datamanager branches in arvados and arvados-dev repositories.ResolvedTom Clegg


Related issues

Related to Arvados - Story #3408: [Keep] Implement Production Data ManagerResolved07/29/2014

Related to Arvados - Bug #7252: [SDKs] Return errors instead of calling log.Fatal in code that needs to be testedResolved09/10/2015

Related to Arvados - Bug #7253: [Keep] [Data Manager] [SDKs] Test behavior in the face of API and manifest errorsResolved09/10/2015

Related to Arvados - Bug #7255: [Data Manager] Test some non-trivial manifests in integration testResolved09/10/2015

Associated revisions

Revision 5dd20e4a
Added by Tom Clegg over 4 years ago

Merge branch '6260-work-queue-status' refs #6260

Revision c43d72c6
Added by Radhika Chippada about 4 years ago

refs #6260
Merge branch '6260-test-datamanager'

Revision c43d72c6
Added by Radhika Chippada about 4 years ago

refs #6260
Merge branch '6260-test-datamanager'

Revision 07339690
Added by Radhika Chippada about 4 years ago

closes #6260
Merge branch '6260-test-datamanager'

History

#1 Updated by Peter Amstutz over 4 years ago

  • Category set to Keep

#2 Updated by Brett Smith over 4 years ago

  • Description updated (diff)
  • Category deleted (Keep)
  • Target version deleted (Arvados Future Sprints)
  • Story points deleted (2.0)

#3 Updated by Brett Smith over 4 years ago

  • Project changed from Arvados Private to Arvados
  • Parent task set to #6221

#4 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#5 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#6 Updated by Tom Clegg over 4 years ago

  • Parent task deleted (#6221)
  • Story points set to 1.0

#7 Updated by Tom Clegg over 4 years ago

  • Category set to Keep
  • Target version set to Arvados Future Sprints

#8 Updated by Brett Smith over 4 years ago

  • Tracker changed from Task to Story
  • Target version changed from Arvados Future Sprints to 2015-08-19 sprint

#9 Updated by Tom Clegg over 4 years ago

  • Tracker changed from Story to Task
  • Description updated (diff)
  • Target version changed from 2015-08-19 sprint to Arvados Future Sprints

#10 Updated by Brett Smith over 4 years ago

  • Tracker changed from Task to Story
  • Description updated (diff)
  • Target version changed from Arvados Future Sprints to 2015-08-19 sprint
  • Story points changed from 1.0 to 2.0

Two points per discussion.

#11 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#12 Updated by Tom Clegg over 4 years ago

6260-work-queue-status refactors WorkQueue a bit so current pull and trash queue status can be retrieved
  • by clients via /status.json
  • by keepstore test cases via pullq.Status() and trashq.Status()

This should make it possible to write reliable integration tests that run as fast as possible.

#13 Updated by Radhika Chippada over 4 years ago

Reviewed branch 6260-work-queue-status at d89b7ae1f6fa35dd3627ead14c855751f1de2193

  • Given the amount of changes in this branch, I may not have gone through each line as thoroughly as I would have wanted. Overall, changes made sense and all tests passed.
  • I had difficulty liking the name “Gate” in volume_test. Can we call it “Sender” or something a bit more descriptive of what it actually does?
  • “ReportDone” => Did you mean to use the word “Report” as a verb here? Wondering if we can call it “DoneStatus”?
  • It seems a bit weird that status_test.go has only a func that is used by other tests to get the status of an item. Would it make sense to add at least one test in this file? Something like get status would be empty for unsupported names or when queues aren't initialized or something like that?

These are just nitpicking. LGTM either way.

#14 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#15 Updated by Brett Smith over 4 years ago

  • Target version changed from 2015-08-19 sprint to 2015-09-02 sprint

Radhika is working on this, but started late, so we knew we wouldn't finish it this sprint. Moving it to the next, where we do intend to finish it.

#16 Updated by Radhika Chippada over 4 years ago

  • Assigned To set to Radhika Chippada

#17 Updated by Radhika Chippada over 4 years ago

  • Status changed from New to In Progress

#18 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#19 Updated by Radhika Chippada over 4 years ago

Still a work in progress. Some notes (for manual testing during development):

  • Use the two branches: arvados -> 6260-test-datamanager and arvados-dev -> 6260-test-datamanager
  • To run the datamanager tests in the branch 6260-test-datamanager in arvados-dev
    cd ~/arvados-dev/jenkins
    ./run-tests.sh WORKSPACE=~/arvados --leave-temp --only services/datamanager
    

#20 Updated by Brett Smith over 4 years ago

  • Target version changed from 2015-09-02 sprint to 2015-09-16 sprint

#21 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#22 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#23 Updated by Tom Clegg over 4 years ago

  • Description updated (diff)

#24 Updated by Radhika Chippada over 4 years ago

commit b9ad383c4e27d4e3c1945e14ba51fffdd61fdb36 in branch 6260-test-datamanager

  • The "minimal" test listed in description is added with all listed criteria.
    • One note about, "Ideally, the assessment of whether a block has been "deleted" should compare desired to actual replication level": This was not done. The reason being, when I get a collection (created in the test using arv-put), the "replication_desired" and "replication_confirmed" values are nil and hence makes it not possible for me to use this info to verify the block indexes. Hence, sticking with the simpler solution for this (that is, assume that the default replication level is 2).
  • Regarding the list of "error tests" added yesterday as a result of discussions for the above test. Bottom line is datamanager code does not raise errors and handle errors; instead it is doing log.Fatalf for all errors. This is hence making it impossible to write tests because on the first failing test, the test run exits and no other tests run. Also, no way to "catch" an exit code in the test. Mocking it is not an option because if I were to mock the logutil.FatalWithMessage, the rest of the datamanager code will continue to execute, resulting in behavior that is not correct representation of datamanager. The only alternative I see is to make major cleanup of code to raise errors than fatalf on errors (which is out of scope for this ticket).
    • "Network error when getting collection list from API server" --- (In my test that I did not commit) I mocked the collection.GetColletions to mimic an API error and it does resulting in datamanager exit (which is what we want)
    • "Configure with a token that is valid, but does not belong to an admin user ... must stop without deleting anything" --- I set active or spectator user token a datamanager token and ran the datamanager tests and they pass (oops). Hence I updated the datamanager code to verify that the given datamanager token belongs to an admin user. (Unfortunately, I could not commit a test because of datamanager fatalf on any errors though.)
      • One question: In keepstore => handlers.go, do we want to also improve the IsDataManagerToken func to verify that the token is admin token?

#25 Updated by Tom Clegg over 4 years ago

Regarding the "impossible to test something that calls Fatal() instead of returning an error" problem: I'd say the logging code does need some major cleanup, but in the mantime I wonder how many of the specific issues here could be fixed just by doing "return err" instead of log.Fatalf(). It looks like "singlerun" has a couple of examples at the top where returning the error would work just as well.

In SetupDataManagerTest, instead of t.Fatal("Error discovering keep services"), how about t.Fatal(err) to show what the error was? (Fatal() will give a line# so it should already be obvious that the error came from DiscoverKeepServers().) Same comment for MakeArvadosClient() above it.

Using oldBlockLocators[0] as the "old, but used in a collection" example seems to detract from the (otherwise helpful) system of variable names and block contents. Could we use two sets, like oldUnusedBlocks and oldValuableBlocks?

There should be another call to VerifyBlocks() right after the first SingleRun + QueuesFinish, to ensure nothing got deleted (because nothing is backdated or deleted).

Without going to the trouble of accommodating default replication levels other than 2, it seems VerifyBlocks() could trivially count the number of copies stored. Currently, anything listed in expected is expected to appear at least twice. Even better, if GetBlockIndexes() returned a [][]string (a []string for each keepstore server), VerifyBlocks() can confirm that each expected block is stored on two different servers (not just two different disks).

The "reducing replication" test should not require that actual replication is higher than desired -- only that it's not lower. I.e., it should confirm minimum 1. (The current code looks like it requires replication == #servers, which will fail inappropriately when we run this test with more than 2 Keep servers (which, incidentally, would probably be a good idea) or Data Manager starts reducing replication.)

Does WaitUntilQueuesFinishWork typically take ~1 second? I'm thinking we could use a shorter interval here (say 100ms), what with the block counts/sizes being so tiny.

GetCollectionManifest looks like it returns the hash+size parts of the first locator referenced in the given collection's manifest. If this is correct, it should be called something like FirstLocatorFromCollection(). But what concerns me more is that it seems to mean we're only checking GC correctness for the first block in each manifest, which (although it doesn't contradict the spec here) seems like a pretty big testing gap.

I looked into the manifest-parsing code and it seems to hinge on a locator-matching regular expression in the blockdigest package (which is different from the private one in the keepclient package). The main problem here is that, with the existing code, if the blockdigest matcher ever rejects a locator that was passed by the Ruby code used in the API server, Data Manager will skip it, and that block and the rest of the blocks in the stream will be considered garbage. Given the dire consequences of ignoring such parsing errors in Data Manager, I think we need to
  1. look at every token during parseManifestStream, doing at least a superficial check to make sure it's the expected type: e.g., in each line, the first token must start with ".", everything after the last block locator must start with /\d+:\d+:/.
  2. add some way for parseManifestStream to propagate a parse error to the caller, so (in particular) Data Manager can detect that it has gone out of sync with the API server wrt valid manifest format, and abandon ship. (In this situation, it is not safe to delete anything.)
    (The above fixes seem to belong in a separate branch, possibly a separate story?)

In GetCollectionManifest, is there some reason to use exec.Command("arv-get", uuid) etc., instead of GetCollection()["manifest_text"].(string)?

The error check "Locator of the collection with the same data as old block is different %s" is a bit confusing. Shouldn't that trigger on !exists rather than exists? (I think it would also be good to include oldBlockLocators in the error message via "%+v" to help troubleshooting.)

At "Run data manager again" there is a sleep(1s). Is this necessary? (I wouldn't have expected a race condition here between the "update collection" API response and the beginning of the next DataManagerSingleRun() -- and if there is a race, do we know 1 second is enough?)

golint and gofmt have some style/formatting recommendations.

Now that there are additional bugfixes, I wonder if we should adjust the "if never_delete != true" condition in keepstore to say something along the lines of "if !never_delete && !running_in_testing_environment", instead of deleting it outright? That might avoid a "can't merge this branch until we fix other things" problem.

#26 Updated by Tom Clegg over 4 years ago

On the arvados-dev branch:
  • do_test "$g" go is the same for both sides of the condition, so it might as well just go after the fi
  • Should append to user-provided testargs instead of replacing.

#27 Updated by Radhika Chippada over 4 years ago

Regarding the "impossible to test something that calls Fatal() instead of returning an error" problem: I'd say the logging code does need some major cleanup, but in the meantime I wonder how many of the specific issues here could be fixed just by doing "return err" instead of log.Fatalf(). It looks like "singlerun" has a couple of examples at the top where returning the error would work just as well.

Updated datamanager.go to raise errors than Fatalf in most cases and added tests. To be able to write additional tests, we would need to enhance some of the internals such as the logging.

In SetupDataManagerTest, instead of t.Fatal("Error discovering keep services"), how about t.Fatal(err) to show what the error was? (Fatal() will give a line# so it should already be obvious that the error came from DiscoverKeepServers().) Same comment for MakeArvadosClient() above it.

Done

Using oldBlockLocators0 as the "old, but used in a collection" example seems to detract from the (otherwise helpful) system of variable names and block contents. Could we use two sets, like oldUnusedBlocks and oldValuableBlocks?

Done

There should be another call to VerifyBlocks() right after the first SingleRun + QueuesFinish, to ensure nothing got deleted (because nothing is backdated or deleted).

Done

Without going to the trouble of accommodating default replication levels other than 2, it seems VerifyBlocks() could trivially count the number of copies stored. Currently, anything listed in expected is expected to appear at least twice. Even better, if GetBlockIndexes() returned a [][]string (a []string for each keepstore server), VerifyBlocks() can confirm that each expected block is stored on two different servers (not just two different disks).

Done

The "reducing replication" test should not require that actual replication is higher than desired -- only that it's not lower. I.e., it should confirm minimum 1. (The current code looks like it requires replication == #servers, which will fail inappropriately when we run this test with more than 2 Keep servers (which, incidentally, would probably be a good idea) or Data Manager starts reducing replication.)

Done. (Verifies that there are at least 2 replicas among all the server, without assuming that there are only two servers)

Does WaitUntilQueuesFinishWork typically take ~1 second? I'm thinking we could use a shorter interval here (say 100ms), what with the block counts/sizes being so tiny.

Done

GetCollectionManifest looks like it returns the hash+size parts of the first locator referenced in the given collection's manifest. If this is correct, it should be called something like FirstLocatorFromCollection(). But what concerns me more is that it seems to mean we're only checking GC correctness for the first block in each manifest, which (although it doesn't contradict the spec here) seems like a pretty big testing gap.

Renamed as getFirstLocatorFromCollection. Since the goal of this test is to get the first locator from the manifest, because the data is very small and hence only one locator, I think this simplification is acceptable. I think we can do more when we address the next bullet point.

I looked into the manifest-parsing code and it seems to hinge on a locator-matching regular expression in the blockdigest package (which is different from the private one in the keepclient package). The main problem here is that, with the existing code, if the blockdigest matcher ever rejects a locator that was passed by the Ruby code used in the API server, Data Manager will skip it, and that block and the rest of the blocks in the stream will be considered garbage. Given the dire consequences of ignoring such parsing errors in Data Manager, I think we need to ...

I agree. Let’s please do this as a separate branch or ticket.

In GetCollectionManifest, is there some reason to use exec.Command("arv-get", uuid) etc., instead of GetCollection()["manifest_text"].(string)?

I did this to allow testing one more exposed api / command, rather than going the easy route of using the GetCollection that I already had :). Updated to use GetCollection now.

The error check "Locator of the collection with the same data as old block is different %s" is a bit confusing. Shouldn't that trigger on !exists rather than exists? (I think it would also be good to include oldBlockLocators in the error message via "%+v" to help troubleshooting.)

With used and unused old blocks, this is no longer there and should be clearer now.

At "Run data manager again" there is a sleep(1s). Is this necessary? (I wouldn't have expected a race condition here between the "update collection" API response and the beginning of the next DataManagerSingleRun() -- and if there is a race, do we know 1 second is enough?)

Yes, removed it.

golint and gofmt have some style/formatting recommendations.

Ran gofmt

Now that there are additional bugfixes, I wonder if we should adjust the "if never_delete != true" condition in keepstore to say something along the lines of "if !never_delete && !running_in_testing_environment", instead of deleting it outright? That might avoid a "can't merge this branch until we fix other things" problem.

Added back the never_delete block. The condition “running_in_testing_environment” is simulated by checking that the data_manager_token belongs to the admin user fixture, aka test environment, per our conversation.

arvados-dev branch

Made those changes and pushed to that branch as well.

Thanks.

#28 Updated by Radhika Chippada over 4 years ago

Reviewed Tom's updates to the 6260-test-datamanager branch to, among other things, to ensure ARVADOS_API_TOKEN is used rather than requiring another data_manager_token in datamanager implementation.

Everything looks good, except one question. Now that the data_manager_token file is no longer needed, I think we no longer need the following code block that I added in run_test_server.py:

    if os.path.exists(os.path.join(TEST_TMPDIR, "keep.data-manager-token-file")):
        os.remove(os.path.join(TEST_TMPDIR, "keep.data-manager-token-file"))

I removed it and pushed the update to the branch.

#29 Updated by Tom Clegg over 4 years ago

Radhika Chippada wrote:

Regarding the "impossible to test something that calls Fatal() instead of returning an error" problem: I'd say the logging code does need some major cleanup, but in the meantime I wonder how many of the specific issues here could be fixed just by doing "return err" instead of log.Fatalf(). It looks like "singlerun" has a couple of examples at the top where returning the error would work just as well.

Updated datamanager.go to raise errors than Fatalf in most cases and added tests. To be able to write additional tests, we would need to enhance some of the internals such as the logging.

Added #7252

The "reducing replication" test should not require that actual replication is higher than desired -- only that it's not lower. I.e., it should confirm minimum 1. (The current code looks like it requires replication = #servers, which will fail inappropriately when we run this test with more than 2 Keep servers (which, incidentally, would probably be a good idea) or Data Manager starts reducing replication.)

Done. (Verifies that there are at least 2 replicas among all the server, without assuming that there are only two servers)

At 902c331 it looks like it still checks for replication = len(keepServers). I've pushed b2614cf which adds a minReplication argument to verifyBlocks() and uses verifyBlocks(..., 1) to check the reduced-replication case.

GetCollectionManifest looks like it returns the hash+size parts of the first locator referenced in the given collection's manifest. If this is correct, it should be called something like FirstLocatorFromCollection(). But what concerns me more is that it seems to mean we're only checking GC correctness for the first block in each manifest, which (although it doesn't contradict the spec here) seems like a pretty big testing gap.

Renamed as getFirstLocatorFromCollection. Since the goal of this test is to get the first locator from the manifest, because the data is very small and hence only one locator, I think this simplification is acceptable. I think we can do more when we address the next bullet point.

Added #7255 to test manifests with more than one block.

I looked into the manifest-parsing code and it seems to hinge on a locator-matching regular expression in the blockdigest package (which is different from the private one in the keepclient package). The main problem here is that, with the existing code, if the blockdigest matcher ever rejects a locator that was passed by the Ruby code used in the API server, Data Manager will skip it, and that block and the rest of the blocks in the stream will be considered garbage. Given the dire consequences of ignoring such parsing errors in Data Manager, I think we need to ...

I agree. Let’s please do this as a separate branch or ticket.

Added #7253 to make manifest parsing less cavalier.

At "Run data manager again" there is a sleep(1s). Is this necessary? (I wouldn't have expected a race condition here between the "update collection" API response and the beginning of the next DataManagerSingleRun() -- and if there is a race, do we know 1 second is enough?)

Yes, removed it.

Seemed to still be there, but reduced to 100ms, so I removed it (as well as the one in the SingleRunRepeatedly test), seemingly with no ill effects. (This is OK, right?)

golint and gofmt have some style/formatting recommendations.

Ran gofmt

Please check golint too. :) e.g., "don't use ALL_CAPS in Go names; use CamelCase" and "var toBeDeletedCollectionUuid should be toBeDeletedCollectionUUID"

The rest of the changes LGTM, thanks!

#30 Updated by Tom Clegg over 4 years ago

Radhika Chippada wrote:

Everything looks good, except one question. Now that the data_manager_token file is no longer needed, I think we no longer need the following code block that I added in run_test_server.py:

[...]

I removed it and pushed the update to the branch.

I think the commit is good (don't delete the temp file when stopping keep) ... although the above note and the commit log message ("file is no longer needed") aren't quite correct. Data manager doesn't need to read that file, but keepstore still does!

#31 Updated by Radhika Chippada about 4 years ago

Clarified data_manager_token_file issue in the commit log. Made most (a whole lot) of golint suggested updates.

Also, one test in keepproxy, TestPostWithoutHash, needed to be updated. The response after post also include the auth hint, not just locator+size hint.

#32 Updated by Radhika Chippada about 4 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:073396900805175acfc32f974abfea91c456e986.

Also available in: Atom PDF