Story #8724

[Keep] Block validation script

Added by Nico C├ęsar about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
-
Target version:
Start date:
03/16/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
2.0

Description

Write a script to verify that the Keep services on a cluster have a good copy of a given list of blocks. The use case is that you generate a list of existing md5sums before you do a keep-rsync, and run this script after the keep-rsync is done to verify that everything was copied over correctly.

The script takes as input:

  • a file with md5sums to check, one per line
  • a file that contains the Keep signing key
  • API host+token (read from a specified file, as keep-rsync does)
  • allow the admin to specify the keep services using a json (as keep-rsync does)
  • an optional prefix argument

For each block named in the list of md5sums, it generates its own signature for the block, and sends HEAD requests to appropriate (e.g., non-proxy) Keep services in rendezvous hash order. If no Keep service returns 200 OK for a given block, the script reports that as an error, including the hash of the missing block. It should do this for every missing block (i.e., don't abort at the first missing block). Exit nonzero if any blocks were not found.

If a prefix argument is provided, the script only checks blocks that begin with that prefix, a la keep-rsync.

The tool is called keep-block-check. It lives under tools/. For the switches that this tool accepts that keep-rsync also accepts (at least the signing key file and the prefix), this tool should use the same name for the switch that keep-rsync does.

Test that:

  • It exits zero and writes nothing to stdout when it finds all the blocks.
  • When the input includes a block that cannot be found, the tool exits nonzero and writes an error that mentions that the block cannot be found. Test this for cases where Keepstore returns:
    • 401
    • 404
    • 500

Subtasks

Task #8854: Review branch: 8724-keep-block-check-scriptResolvedRadhika Chippada


Related issues

Related to Arvados - Bug #8878: Keep: sudden appearance of "missing" blocksClosed04/04/2016

Blocked by Arvados - Bug #8702: [Keep] HEAD request should do a checksum of the blob without sending itResolved03/15/2016

Associated revisions

Revision 5026c691
Added by Radhika Chippada about 5 years ago

closes #8724
Merge branch '8724-keep-block-check-script'

History

#1 Updated by Brett Smith about 5 years ago

  • Target version set to Kanban

#2 Updated by Brett Smith about 5 years ago

  • Description updated (diff)

#3 Updated by Brett Smith about 5 years ago

  • Subject changed from validate that the copy c97qk and wx7k5 actually is valid to [Keep] Block validation script
  • Target version changed from Kanban to Arvados Future Sprints
  • Tracker changed from Bug to Story
  • Project changed from OPS to Arvados

#4 Updated by Brett Smith about 5 years ago

  • Description updated (diff)
  • Story points changed from 1.0 to 2.0

#5 Updated by Brett Smith about 5 years ago

  • Target version changed from Arvados Future Sprints to 2016-04-13 sprint

#6 Updated by Brett Smith about 5 years ago

  • Assigned To set to Radhika Chippada

#7 Updated by Brett Smith about 5 years ago

  • Description updated (diff)

#8 Updated by Radhika Chippada about 5 years ago

  • Status changed from New to In Progress

#9 Updated by Radhika Chippada about 5 years ago

  • Description updated (diff)

#10 Updated by Tom Clegg about 5 years ago

Too bad about all the copy/pasted code between keep-rsync and here. I suppose this is mainly a sign our SDK needs improvement...

The prefix string should probably be passed to getBlockLocators() instead of performKeepBlockCheck(). That way, we can skip the irrelevant block IDs earlier, and save some time & memory. Also, we won't need to add up totalBlocks, because we'll have len(blockLocators).

The way we build locatorMap and locators has a side effect of rearranging the given block list into an unpredictable order, which seems unnecessary and confusing/annoying (e.g., prevents user from resuming an interrupted run). How about something like this:

done := make(map[string]bool)
for _, line := range strings.Split(string(content), "\n") {
        line = strings.TrimSpace(line)
        if !strings.HasPrefix(line, prefix) || done[line] {
                continue
        }
        locators = append(locators, line)
        done[line] = true
}

The error "API config file not specified" should say "Client config file"

Instead of "Error getting head info for block" how about "Error verifying block %v: %v".

Instead of "Head information not found..." how about
  • "Totals: %d attempts, %d successes, %d errors".
  • Log the totals regardless of whether there were any failures (currently there is no output at all if everything succeeded)
  • Then return a non-nil error if there were any failures

readConfigFromFile() crashes if the config file has a line with no "=":

$ ./keep-block-check -config <(echo foo)
panic: runtime error: index out of range

goroutine 1 [running]:
panic(0x79f2a0, 0xc820010080)
    /usr/local/go/src/runtime/panic.go:464 +0x3e6
main.readConfigFromFile(0x7fff08e34499, 0xa, 0x0, 0x0, 0x0, 0x0, 0x490000, 0x0, 0x0, 0x0, ...)
    /tmp/arvados-build/GOPATH/src/git.curoverse.com/arvados.git/tools/keep-block-check/keep-block-check.go:114 +0x671

This wasn't part of the spec but I think some sort of progress indicator would be really useful. Perhaps log each block ("Checking block %d of %d: %v") if a "-v" flag is given?

Instead of writing logs to a temp file and then deleting it, how about writing them to a bytes.Buffer, and then sending the buffer contents to c.Log() at the end of the test? That way, it will still be silent by default, but if a test fails (or you use "-v") then the logs (which might be useful) get displayed.

Seems like the globals in keep-block-check_test.go could be in ServerRequiredSuite or DoMainTestSuite instead.

Tests would be tidier if doMain() accepted an args []string argument: main() would pass os.Args[1:], and tests would just say doMain([]string{"whatever"}) instead of modifying and restoring the os.Args global. This should also avoid confusion between the arguments given to "go test" and the arguments a test case is trying to simulate.

The testKeepServicesJSON string should be reformatted with backticks, like

var testKeepServicesJSON = `{
 "kind":"arvados#keepServiceList",
 "etag":"...",
 ...
}`

checkErrorLog() seems to do completely different things depending on whether blocks is empty. How about separate functions like checkNoErrorsLogged(c) and checkErrorsLogged(c, blocks, prefix, suffix)?

The "zfhgfenh...." string should move (from keep-rsync.go and keep-block-check.go) to source:sdk/go/arvadostest/fixtures.go as a BlobSigningKey const.

#11 Updated by Radhika Chippada about 5 years ago

  • The prefix string should probably be passed to getBlockLocators() instead of performKeepBlockCheck().

Done

  • The way we build locatorMap and locators has a side effect of rearranging the given block list into an unpredictable order

Done

  • The error "API config file not specified" should say "Client config file"

Done

  • Instead of "Error getting head info for block" how about "Error verifying block %v: %v".

Done

  • Instead of "Head information not found..." how about

Done

  • readConfigFromFile() crashes if the config file has a line with no "=":

Addressed this

  • This wasn't part of the spec but I think some sort of progress indicator would be really useful. Perhaps log each block ("Checking block %d of %d: %v") if a "-v" flag is given?

Done

  • Instead of writing logs to a temp file and then deleting it, how about writing them to a bytes.Buffer ...

Done

  • Seems like the globals in keep-block-check_test.go could be in ServerRequiredSuite or DoMainTestSuite instead.

Cleaned up and got rid off several of them.

  • Tests would be tidier if doMain() accepted an args []string argument: main() would pass os.Args[1:] ...

Done

  • The testKeepServicesJSON string should be reformatted with backticks, like

Done

  • checkErrorLog() seems to do completely different things depending on whether blocks is empty. How about separate functions like checkNoErrorsLogged(c) and checkErrorsLogged(c, blocks, prefix, suffix)?

Done

Done

#12 Updated by Tom Clegg about 5 years ago

Thanks - LGTM @ f735a9c

With 4xphq-datamanager.conf like this...

ARVADOS_API_HOST=4xphq.arvadosapi.com
ARVADOS_API_TOKEN={{data manager token here}}
ARVADOS_BLOB_SIGNING_KEY={{blob signing key here}}

Test run looks like this...

tomclegg@shell.4xphq:~$ ./keep-block-check \
 -config 4xphq-datamanager \
 -block-hash-file <((. ~/.config/arvados/4xphq-datamanager.conf; curl -s -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" "http://keep0.4xphq.arvadosapi.com:25107/index/bb") | cut -d\  -f1) \
 -keep-services-json "$(arv keep_service list)" \
 -v
2016/04/12 17:32:37 Verifying block 1 of 92: bb02959e3b87aa052e39751729a8af44+29474304
2016/04/12 17:32:39 Verifying block 2 of 92: bb029c0878afd62803daa804a1a8bef3+33
2016/04/12 17:32:39 Verifying block 3 of 92: bb061f8d9d13594a8f49309e56016a63+67108864
...
2016/04/12 17:34:27 Verifying block 90 of 92: bbf4fb94f10b1ce9bb7f132bc7f92e4c+11693
2016/04/12 17:34:27 Verifying block 91 of 92: bbf7a4e82e27eacfc28212e3d4c3e546+67108864
2016/04/12 17:34:36 Verifying block 92 of 92: bbf80bb9c0b98189d348759d8269d673+19004
2016/04/12 17:34:36 Verify block totals: 92 attempts, 92 successes, 0 errors

(For some reason I couldn't manage to get an index from 4xphq's keepproxy to do a more realistic test.)

#13 Updated by Radhika Chippada about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF