Idea #8724
closed[Keep] Block validation script
Description
Write a script to verify that the Keep services on a cluster have a good copy of a given list of blocks. The use case is that you generate a list of existing md5sums before you do a keep-rsync, and run this script after the keep-rsync is done to verify that everything was copied over correctly.
The script takes as input:
- a file with md5sums to check, one per line
- a file that contains the Keep signing key
- API host+token (read from a specified file, as keep-rsync does)
- allow the admin to specify the keep services using a json (as keep-rsync does)
- an optional prefix argument
For each block named in the list of md5sums, it generates its own signature for the block, and sends HEAD requests to appropriate (e.g., non-proxy) Keep services in rendezvous hash order. If no Keep service returns 200 OK for a given block, the script reports that as an error, including the hash of the missing block. It should do this for every missing block (i.e., don't abort at the first missing block). Exit nonzero if any blocks were not found.
If a prefix argument is provided, the script only checks blocks that begin with that prefix, a la keep-rsync.
The tool is called keep-block-check. It lives under tools/
. For the switches that this tool accepts that keep-rsync also accepts (at least the signing key file and the prefix), this tool should use the same name for the switch that keep-rsync does.
Test that:
- It exits zero and writes nothing to stdout when it finds all the blocks.
- When the input includes a block that cannot be found, the tool exits nonzero and writes an error that mentions that the block cannot be found. Test this for cases where Keepstore returns:
- 401
- 404
- 500
Updated by Brett Smith almost 9 years ago
- Subject changed from validate that the copy c97qk and wx7k5 actually is valid to [Keep] Block validation script
- Target version changed from Kanban to Arvados Future Sprints
- Tracker changed from Bug to Idea
- Project changed from 40 to Arvados
Updated by Brett Smith almost 9 years ago
- Description updated (diff)
- Story points changed from 1.0 to 2.0
Updated by Brett Smith almost 9 years ago
- Target version changed from Arvados Future Sprints to 2016-04-13 sprint
Updated by Radhika Chippada over 8 years ago
- Status changed from New to In Progress
Updated by Tom Clegg over 8 years ago
Too bad about all the copy/pasted code between keep-rsync and here. I suppose this is mainly a sign our SDK needs improvement...
The prefix
string should probably be passed to getBlockLocators() instead of performKeepBlockCheck(). That way, we can skip the irrelevant block IDs earlier, and save some time & memory. Also, we won't need to add up totalBlocks, because we'll have len(blockLocators)
.
The way we build locatorMap
and locators
has a side effect of rearranging the given block list into an unpredictable order, which seems unnecessary and confusing/annoying (e.g., prevents user from resuming an interrupted run). How about something like this:
done := make(map[string]bool)
for _, line := range strings.Split(string(content), "\n") {
line = strings.TrimSpace(line)
if !strings.HasPrefix(line, prefix) || done[line] {
continue
}
locators = append(locators, line)
done[line] = true
}
The error "API config file not specified" should say "Client config file"
Instead of "Error getting head info for block" how about "Error verifying block %v: %v".
Instead of "Head information not found..." how about- "Totals: %d attempts, %d successes, %d errors".
- Log the totals regardless of whether there were any failures (currently there is no output at all if everything succeeded)
- Then return a non-nil error if there were any failures
readConfigFromFile()
crashes if the config file has a line with no "=":
$ ./keep-block-check -config <(echo foo) panic: runtime error: index out of range goroutine 1 [running]: panic(0x79f2a0, 0xc820010080) /usr/local/go/src/runtime/panic.go:464 +0x3e6 main.readConfigFromFile(0x7fff08e34499, 0xa, 0x0, 0x0, 0x0, 0x0, 0x490000, 0x0, 0x0, 0x0, ...) /tmp/arvados-build/GOPATH/src/git.curoverse.com/arvados.git/tools/keep-block-check/keep-block-check.go:114 +0x671
This wasn't part of the spec but I think some sort of progress indicator would be really useful. Perhaps log each block ("Checking block %d of %d: %v") if a "-v" flag is given?
Instead of writing logs to a temp file and then deleting it, how about writing them to a bytes.Buffer, and then sending the buffer contents to c.Log() at the end of the test? That way, it will still be silent by default, but if a test fails (or you use "-v") then the logs (which might be useful) get displayed.
Seems like the globals in keep-block-check_test.go could be in ServerRequiredSuite or DoMainTestSuite instead.
Tests would be tidier if doMain()
accepted an args []string
argument: main()
would pass os.Args[1:]
, and tests would just say doMain([]string{"whatever"})
instead of modifying and restoring the os.Args
global. This should also avoid confusion between the arguments given to "go test" and the arguments a test case is trying to simulate.
The testKeepServicesJSON
string should be reformatted with backticks, like
var testKeepServicesJSON = `{ "kind":"arvados#keepServiceList", "etag":"...", ... }`
checkErrorLog()
seems to do completely different things depending on whether blocks
is empty. How about separate functions like checkNoErrorsLogged(c)
and checkErrorsLogged(c, blocks, prefix, suffix)
?
The "zfhgfenh...."
string should move (from keep-rsync.go and keep-block-check.go) to source:sdk/go/arvadostest/fixtures.go as a BlobSigningKey const.
Updated by Radhika Chippada over 8 years ago
- The prefix string should probably be passed to getBlockLocators() instead of performKeepBlockCheck().
Done
- The way we build locatorMap and locators has a side effect of rearranging the given block list into an unpredictable order
Done
- The error "API config file not specified" should say "Client config file"
Done
- Instead of "Error getting head info for block" how about "Error verifying block %v: %v".
Done
- Instead of "Head information not found..." how about
Done
- readConfigFromFile() crashes if the config file has a line with no "=":
Addressed this
- This wasn't part of the spec but I think some sort of progress indicator would be really useful. Perhaps log each block ("Checking block %d of %d: %v") if a "-v" flag is given?
Done
- Instead of writing logs to a temp file and then deleting it, how about writing them to a bytes.Buffer ...
Done
- Seems like the globals in keep-block-check_test.go could be in ServerRequiredSuite or DoMainTestSuite instead.
Cleaned up and got rid off several of them.
- Tests would be tidier if doMain() accepted an args []string argument: main() would pass os.Args[1:] ...
Done
- The testKeepServicesJSON string should be reformatted with backticks, like
Done
- checkErrorLog() seems to do completely different things depending on whether blocks is empty. How about separate functions like checkNoErrorsLogged(c) and checkErrorsLogged(c, blocks, prefix, suffix)?
Done
- The "zfhgfenh...." string should move (from keep-rsync.go and keep-block-check.go) to source:sdk/go/arvadostest/fixtures.go as a BlobSigningKey const.
Done
Updated by Tom Clegg over 8 years ago
Thanks - LGTM @ f735a9c
With 4xphq-datamanager.conf like this...
ARVADOS_API_HOST=4xphq.arvadosapi.com ARVADOS_API_TOKEN={{data manager token here}} ARVADOS_BLOB_SIGNING_KEY={{blob signing key here}}
Test run looks like this...
tomclegg@shell.4xphq:~$ ./keep-block-check \ -config 4xphq-datamanager \ -block-hash-file <((. ~/.config/arvados/4xphq-datamanager.conf; curl -s -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" "http://keep0.4xphq.arvadosapi.com:25107/index/bb") | cut -d\ -f1) \ -keep-services-json "$(arv keep_service list)" \ -v 2016/04/12 17:32:37 Verifying block 1 of 92: bb02959e3b87aa052e39751729a8af44+29474304 2016/04/12 17:32:39 Verifying block 2 of 92: bb029c0878afd62803daa804a1a8bef3+33 2016/04/12 17:32:39 Verifying block 3 of 92: bb061f8d9d13594a8f49309e56016a63+67108864 ... 2016/04/12 17:34:27 Verifying block 90 of 92: bbf4fb94f10b1ce9bb7f132bc7f92e4c+11693 2016/04/12 17:34:27 Verifying block 91 of 92: bbf7a4e82e27eacfc28212e3d4c3e546+67108864 2016/04/12 17:34:36 Verifying block 92 of 92: bbf80bb9c0b98189d348759d8269d673+19004 2016/04/12 17:34:36 Verify block totals: 92 attempts, 92 successes, 0 errors
(For some reason I couldn't manage to get an index from 4xphq's keepproxy to do a more realistic test.)
Updated by Radhika Chippada over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset 5026c691c3b2b402243acd4c2dd936aa7976ba2b .