Bug #7167
closed[Deployment] Write an efficient Keep migration script
Added by Brett Smith over 9 years ago. Updated over 9 years ago.
Description
This is a script aimed at system administrators who are migrating a cluster from one installation to another. It copies Keep data from the old to the new cluster, in a way that's efficient both for the migration itself, and for accessing data on the destination cluster (in other words, blocks live on services early in their rendezvous hash order).
Functional requirements¶
- The script dynamically finds all blocks available on the source cluster. This can only be done by getting the "index" from each keepstore on the source side.
- Get each block from the source cluster exactly once, and write it to the destination cluster, using standard Keep APIs and algorithms (e.g., rendezvous hashing, checksum validation). This can be done with the existing Keep SDKs.
- Include a checkpointing mechanism so that if the process is interrupted, it has a record of what blocks have already been copied and doesn't re-send them. In the implementation below, the keep block index on the destination side serves as the checkpoint mechanism.
- When writing a block on the destination side, use the destination cluster's default replication level, as given in the discovery document.
- The "destination cluster" may just be a series of Keepstores that are being prepped to replace an existing cluster. It must be possible for the administrator to get data copied to that destination without an API server in front of them.
- Determine the desired replication level for each block by reading all collection records from the source cluster, and write to the destination cluster based on that information. (Until then, keep-rsync will use the destination cluster's default replication level, leaving further adjustments to the destination cluster's Data Manager after the database has been migrated.)
- Verify integrity of blocks that (according to the checkpoint/index data on the destination side) already exist on the destination side. For now, we assume that some other mechanism is responsible for ensuring corrupt blocks aren't listed in keepstore index responses.
Implementation¶
keep-rsync will be written in Go. Source code will live in source:services/keep-rsync. Debian/RedHat packages, and the binaries they install, will be called keep-rsync.- Accepts
-src
and-dst
arguments and reads settings/conf files just like arv-copy.- Reads
ARVADOS_BLOB_SIGNING_KEY
from the settings files in addition to the usual*_HOST
,*_HOST_INSECURE
, and*_TOKEN
entries. TheARVADOS_API_TOKEN
entry in each settings file must be the "data manager token" recognized by the relevant Keep servers.
- Reads
- Accepts optional
-dst-keep-services-json
(and-src-keep-services-json
for good measure) arguments, giving files whose contents look just like the output of "arv --json keep_services accessible". This will allow the user to control the dst/src Keep services in situations where the relevant API service isn't working/reachable/configured. If not given, let keepclient discover keep services as usual. - Accepts a
-replication
argument (default to whatever is advertised in "destination" discovery doc). - Accepts a
-prefix
argument that passes through to index requests on both sides. This makes it possible to divide the work into (e.g.) 16 asynchronous jobs, one for each hex digit. - Gets indexes from the source and destination keepstores.
- Gets data from source keepstores/keepproxy, stores in destination using configured replication level.
- Uses regular SDK functions to get and put blocks.
- Displays progress.
- "getting indexes: 10... 9... [...]" (count down number of indexes todo)
- "copying data block 1 of 1234 (0% done, ETA 2m3s): acbd18db4cc2f85cedef654fccc4a4d8+3"
Usage example¶
How to use in a migration:- Turn off data manager on destination cluster.
- Run keep-rsync.
- Disable access to source cluster.
- Dump database and restore to destination cluster.
- Run keep-rsync again.
Updated by Brett Smith over 9 years ago
Notes from Ward: Optimize for the following, in order:
- Development time
- Egress bandwidth (this is what we pay the most for)
- Reliability
- Scalability (e.g., you can run more than one at a time to decrease total runtime)
Updated by Brett Smith over 9 years ago
- Target version changed from Arvados Future Sprints to 2015-09-30 sprint
Updated by Peter Amstutz over 9 years ago
What's the rational for having this "pull" from the destination instead of "push" from the source? The issues of fetching the index and potential security question of having a copy of the data manager key and blob signing key offsite both go away if the data is "pushed" from the source cluster to the target cluster.
Updated by Tom Clegg over 9 years ago
- Description updated (diff)
- Story points set to 5.0
Updated by Tom Clegg over 9 years ago
Peter Amstutz wrote:
What's the rational for having this "pull" from the destination instead of "push" from the source? The issues of fetching the index and potential security question of having a copy of the data manager key and blob signing key offsite both go away if the data is "pushed" from the source cluster to the target cluster.
As specified this can be used as either a "push" or a "pull": functionally, it doesn't matter whether you run it on the source cluster, the destination cluster, or somewhere else entirely, as long as you can access Keep (either keepproxy or the individual keepstore nodes) at both ends, and your bandwidth to both ends is compatible with your patience/deadlines.
If the destination index can't be fetched, keep-rsync would have to maintain its own independent "checkpoint state" information. A checkpoint state other than the actual destination index would not be nearly as convincing as an assurance that all blocks have been copied. For example, it would be left as a manual exercise to determine whether Data Manager on the destination side had garbage-collected some of the data that (according to the checkpoint) had already been copied.
It isn't the scenario at hand, but yes, it is possible to imagine a usage scenario where source and destination clusters don't trust one another enough for either one to see the other's data manager key, but do want to copy data wholesale. To support cases like this, we could do a couple of things:- split the "data manager key" in two: one key for getting index, one for deleting blocks (this might be a good idea anyway). It would border on silly for the source cluster to refuse to give out its "get index" key even though it's agreeing to send all of its data, given how easy it is to change data manager keys after the copy finishes.
- add a "maintain own destination index" feature -- essentially a checkpoint file that can be populated from a keepstore index and updated on the fly. With the right set of features, this would be very flexible: the sysadmin could decide when to copy the destination index out-of-band, when to have keep-rsync update it by adding blocks as they're copied, and when to have keep-rsync update it by using the keepstore "index" API on the destination side.
Updated by Brett Smith over 9 years ago
- Target version changed from 2015-09-30 sprint to Arvados Future Sprints
Updated by Brett Smith over 9 years ago
- Target version changed from Arvados Future Sprints to 2015-10-14 sprint
Updated by Radhika Chippada over 9 years ago
- Assigned To set to Radhika Chippada
Updated by Radhika Chippada over 9 years ago
- Status changed from New to In Progress
Updated by Radhika Chippada over 9 years ago
commit f3b5ffc2a4409d4c1b676e40ba4e582ca2beeb61
Enhanced our testing environment to allow creating a second set of keepservers to facilitate keep-rsync testing.
Updated by Radhika Chippada over 9 years ago
Review comments for 7167-blob-sign-sdk branch:
- keepclient/perms.go
- glint suggests adding comments for ErrSignatureExpired etc (all 4 of the errors defined)
- Comment on VerifySignature says “ExpiredError” and “PermissionError”. It would be helpful to either use the same names as the defined errors or something like “signature expired error” (plain text than a non-existing error definition name) …
- Would you hate too much the names expiryHex, expiryTime (instead of expHex, expTime) and signatureHex (in place of sigHex)?
- keepclient/perms_test.go is failing due to error definition renames; "undefined: PermissionError" and "undefined: ExpiredError"
- keepstore/permission_tests.go: since there is no other tests “without” a permission signature, I don’t see why “TestSignLocatorUsesPermissionSecret” can’t simply be “TestSignLocator” (same with the other test). If you prefer to use such a name, can we call it “TestSignLocator’With/Using’PermissionSecret”?
- While at it, may be you can address the golint complaints you newly introduced in keepstore/azure_blob_volume.go?
Updated by Tom Clegg over 9 years ago
Radhika Chippada wrote:
Review comments for 7167-blob-sign-sdk branch:
- keepclient/perms.go
- glint suggests adding comments for ErrSignatureExpired etc (all 4 of the errors defined)
Fixed (and merged "Malformed" into "Invalid")
- Comment on VerifySignature says “ExpiredError” and “PermissionError”. It would be helpful to either use the same names as the defined errors or something like “signature expired error” (plain text than a non-existing error definition name) …
Fixed
- Would you hate too much the names expiryHex, expiryTime (instead of expHex, expTime) and signatureHex (in place of sigHex)?
Fixed
- keepclient/perms_test.go is failing due to error definition renames; "undefined: PermissionError" and "undefined: ExpiredError"
Fixed (whoops, only ran keepstore tests, not sdk tests!)
- keepstore/permission_tests.go: since there is no other tests “without” a permission signature, I don’t see why “TestSignLocatorUsesPermissionSecret” can’t simply be “TestSignLocator” (same with the other test). If you prefer to use such a name, can we call it “TestSignLocator’With/Using’PermissionSecret”?
Fixed. (I was trying to say the only thing tested here that isn't already covered in the SDK tests is that PermissionSecret is actually being used.)
- While at it, may be you can address the golint complaints you newly introduced in keepstore/azure_blob_volume.go?
Added this to #7159, since it's still open.
Updated by Radhika Chippada over 9 years ago
Branch 7167-blob-sign-sdk lgtm. Thanks.
Updated by Tom Clegg over 9 years ago
SDK¶
This "config struct" idea seemed good when we talked about it before, but now that I'm looking at it, it seems redundant. Is there an advantage to this:client := arvadosclient.New(arvadosclient.APIConfig{ APIToken: "foo", APIHost: "bar", })
- ...over this:
client := arvadosclient.ArvadosClient{ APIToken: "foo", APIHost: "bar", }
- IOW, it seems like the only real purpose of New() is to initialize the Client field. The Go stdlibs seems to use "leave Client nil to use the default", so that would be ideal. For this case I don't think we actually need to deal with that, though: we can move "readConfigFromFile" to
arvadosclient.NewFromFile(filename)
and initialize Client there. I suppose this means adding a BlobSigningKey field to the ArvadosClient struct, which seems like a reasonable change even though most clients will just leave it empty / not use it. (Conveniently, this should fix the bug that keep-rsync uses the same blobSigningKey global for both src and dst.)
Config file reader should use strings.SplitN(line, "=", 2)
, and use TrimSpace()
on kv[0]
and kv[1]
Config file reader should accept "foo" (a config file containing no "/") to mean "read config from $HOME/.config/arvados/foo.conf", like arv-copy does.
"initializeKeepRsync" looks like it should be called "setupKeepClients", and it looks like we could fix the use of globals something like this:- setupKeepClients returns two *keepclient.KeepClient and an error
- instead of tracking an arvSrc variable separately from kcSrc, just use kcSrc.Arvados when you need it
run_test_server¶
--keep-enforce-permissions
should be action='store_true'
instead of type=str
enforce_permissions
should not become a global. Just pass run_keep(enforce_permissions=args.keep_enforce_permissions)
.
The "--keep-existing" flag looks like trouble. It seems like a circuitous way to satisfy a rather specific and unusual use case: "start two keepstore processes that the API server doesn't know about, and one that it does know about." For the sake of getting keep-rsync working I can live with it. But I think it would be better to start keepstore procs directly from arvadostest without using the Python stuff at all, and do most of the rsync tests by specifying srcKeepServicesJSON
and dstKeepServicesJSON
(i.e., without contacting an API server at all).
keep-rsync¶
copyBlocksToDst should take kcSrc and kcDst as arguments, rather than using globals. (Same goes for the other new globals -- seems like they can all be passed in to the relevant functions as arguments.)
Combine these into one line ("Getting block %d of %d: %v"
?):
+ log.Printf("Getting block %d of %d", done+1, total)
+
+ log.Printf("Getting block: %v", locator)
In getUniqueLocators, how about:
- if _, ok := uniqueLocators[locator]; !ok {
- uniqueLocators[locator] = true
- }
+ uniqueLocators[locator] = true
Also in getUniqueLocators, instead of building one long index and then splitting it, how about something like:
for uuid := range kc.LocalRoots() { reader, err := kc.GetIndex(uuid, indexPrefix) scanner := bufio.NewScanner(reader) for scanner.Scan() { uniqueLocators[strings.Split(scanner.Text(), " ")[0]] = true } }
main()
already prints the error returned by performKeepRsync()
, so it shouldn't be necessary to log errors from copyBlocksToDst()
as well. For example, how about:
- log.Printf("Error putting block data: %q %v", locator, err)
- return err
+ return fmt.Errorf("Write error: %q %v", locator, err)
The log message "copying block" should say "Writing block %d of %d: %v" (pick one of %q and %v and use for both "reading" and "writing" messages)
keep-rsync tests¶
setupTestData has three copies of what looks like a verbose way to do something like this:hash := kcDst.PutB([]byte(fmt.Sprintf("test-data-%d", i))) _, _, err := kcDst.Ask(getLocator) c.Assert(err, IsNil) dstLocators = append(dstLocators, hash) // hash will be already be signed if necessary
- I don't think the extra assertions belong here. We're not testing keepclient, we're just setting up our keep-rsync test.
c.Check(localRoots != nil, Equals, true)
should be c.Check(localRoots, Not(IsNil))
Updated by Radhika Chippada over 9 years ago
SDK
This "config struct" idea seemed good when we talked about it before, ...
Done
we can move "readConfigFromFile" to arvadosclient.NewFromFile(filename) and initialize Client there
With all the changes in (the one above and the one below), I did not see the need to move this and change / rearrange / add additional tests. Hence, I left this here. May be we can move this when we have another use case for this?
(Conveniently, this should fix the bug that keep-rsync uses the same blobSigningKey global for both src and dst.)
Fixed this
Config file reader should use strings.SplitN(line, "=", 2), and use TrimSpace() on kv0 and kv1
Done
Config file reader should accept "foo" (a config file containing no "/") to mean "read config from $HOME/.config/arvados/foo.conf", like arv-copy does.
Done
"initializeKeepRsync" looks like it should be called "setupKeepClients",
Done
and it looks like we could fix the use of globals something like this:
Done and got rid off all the globals
run_test_server
--keep-enforce-permissions should be action='store_true' instead of type=str
Done
enforce_permissions should not become a global. Just pass run_keep(enforce_permissions=args.keep_enforce_permissions).
Done
The "--keep-existing" flag looks like trouble ...
Removed this and replaced with "num-keep-servers". All the 3 keep servers are created at once now and the first two are used as src and the third is used as dst
keep-rsync
copyBlocksToDst should take kcSrc and kcDst as arguments,
Done. Removed all globals
Combine these into one line ("Getting block %d of %d: %v"?):
Done
In getUniqueLocators, how about:
Done
Also in getUniqueLocators, instead of building one long index and then splitting it, how about something like:
Done
main() already prints the error returned by performKeepRsync(), so it shouldn't be necessary to log errors from copyBlocksToDst() as well. For example, how about:
Done
The log message "copying block" should say "Writing block %d of %d: %v" (pick one of %q and %v and use for both "reading" and "writing" messages)
Done
keep-rsync tests
setupTestData has three copies of what looks like a verbose way to do something like this:
Done. Also, since PutB also return err, I added the check on PutB and did not see the need to do an Ask
c.Check(localRoots != nil, Equals, true) should be c.Check(localRoots, Not(IsNil))
Done. NotNil is available.
Updated by Tom Clegg over 9 years ago
I can live with using APIConfig this way in keep-rsync for now, but let's move it from SDK to keep-rsync, unexport it, and get rid of the redundant "consists of" comment.
Rename DiscoverKeepServersFromJSON
to LoadKeepServicesFromJSON
MakeKeepClientFromJSON
doesn't really seem necessary: a caller can do MakeKeepClient()
(and ignore any error) followed by LoadKeepServicesFromJSON()
. How about func New(arv *arvadosclient.ArvadosClient) *KeepClient
? This would be a step toward what I think would be a more reasonable API:
// always: kc := keepclient.New() // optional: kc.LoadKeepServicesFromJSON(...) // discover keep services automatically on first use if you didn't Load or Discover them yourself: kc.Get(foo)
- For now, "load or discover" will be a mandatory step when using the
New()
method. When we make the "automatic" way work, new code will be a bit easier to write, and existing code will continue to work.
MakeKeepClient
(and any new variants) should initialize its "insecure" flag from arv.ApiInsecure
instead of an env var.
In run_test_server.py
, stop_keep()
should also accept a num_servers argument.
In run_test_server.py
, instead of the "if args.num_keep_servers is not None:
" block, get argparse to do it -- say parser.add_argument('--num-keep-servers', default=2, ...)
, and pass args.num_keep_servers
directly to run_keep()
. (This should also let you remove "default is 2 keep servers"
from the help text because argparse will already know/show what the default is.)
- srcConfig, dstConfig, srcBlobSigningKey, _, err := loadConfig(srcConfigFile, dstConfigFile) + srcConfig, srcBlobSigningKey, _, err := loadConfig(srcConfigFile) + if err != nil { ... } + dstConfig, _, _, err := loadConfig(dstConfigFile)
Help text for -replications
should say that 0 means "use default replication level configured on destination server".
If -replications
isn't provided and there's an error loading the destination discovery doc, return an error instead of using 2.
Updated by Radhika Chippada over 9 years ago
let's move it from SDK to keep-rsync, unexport it, and get rid of the redundant "consists of" comment.
Done
Rename DiscoverKeepServersFromJSON to LoadKeepServicesFromJSON
Done
MakeKeepClientFromJSON doesn't really seem necessary ... For now, "load or discover" will be a mandatory step when using the New() method
Done
MakeKeepClient (and any new variants) should initialize its "insecure" flag from arv.ApiInsecure instead of an env var.
Done
In run_test_server.py, stop_keep() should also accept a num_servers argument.
Done
In run_test_server.py, ... default=2 ...
Done
loadConfig, readConfigFromFile, and setupKeepClients have two copies of code ...
Done. readConfigFromFile was already reused
Help text for -replications should say that 0 means "use default replication level configured on destination server".
Done
If -replications isn't provided and there's an error loading the destination discovery doc, return an error instead of using 2.
Done
Updated by Tom Clegg over 9 years ago
TestRsyncPutInOne_GetFromOtherShouldFail
and TestRsyncWithBlobSigning_PutInOne_GetFromOtherShouldFail
tests
- should each be about two lines long -- just
setupRsync(c,whatever,true,1)
and then call atestNoCrosstalk
helper twice, once for (src,dst) and once for (dst,src). - don't need to check whether blobSigningKey is empty (SignLocator is a no-op if the signing key is empty)
- shouldn't do all those extra checks that are really just testing keepclient and keepstore -- testNoCrosstalk() probably just needs something like
dstLocator, _, err := kcDst.PutB([]byte("srcdata")) c.Assert(err, Equals, nil) dstLocator = strings.Split(dstLocator, "+A")[0] _, _, _, err := kcSrc.Get(keepclient.SignLocator(dstLocator, ...)) c.Assert(err, NotNil) c.Check(err.Error(), Equals, "Block not found")
Make dstIndex
accessible to the individual test cases (like dstLocators etc) so TestKeepRsync_WithNoSuchPrefixInSrc
can check len(dstIndex) len(dstLocators)
, and TestKeepRsync_WithPrefix
can check len(dstIndex) > len(dstLocators)
.
arvadostest.StartAPI()
should move from setupRsync()
to SetUpSuite()
(we can leave the same one running throughout the tests, right?)
arvadostest.StopKeep()
should move from TearDownSuite()
to TearDownTest()
(since we StartKeep in each test)
The setupRsync
arguments updateDstReplications bool, replications int
seem to be redundant. How about just pass zero as replications
when testing "discovery doc default" cases, and drop updateDstReplications
?
setupRsync
calls setupKeepClients
with a replications
arg, but then it overwrites kcDst.Want_replicas by itself. Instead of overwriting, it should check kcDst.Want_replicas replications
(unless replications==0), right? (IOW, the checks done in TestInitializeRsyncReplicationsCount
and TestInitializeRsyncDefaultReplicationsCount
should be done right in setupRsync
itself instead.)
The first if prefix "" {
case in testKeepRsync
looks unnecessary: the "else" case is correct whether or not prefix ""
.
The "else" side of the second if prefix == "" {
case should comment that the magic number 2 depends on the coincidence that len(dstLocators)
is 2
and len(dstLocatorsMatchingPrefix)
is 0
(or would be if we computed it) for all of the prefixes that we have tests for.
Probably defer: It looks like we aren't testing whether we waste time copying blocks that already exist in dst. Perhaps we could backdate the "dst" blocks before running rsync, then filter dstIndex by mtime, and make sure none of dstLocators were touched during the rsync.
After 1 block is done, the progress display should show ETA based on time elapsed and number of blocks done/todo. (We could go further and inspect sizes, but it won't make much difference to the outcome: either there are lots of blocks, in which case the time taken for an average "done" block is a good prediction of an average "todo" block, or there aren't lots of blocks, in which case ETA is small and probably doesn't matter that much.)
Instead of ioutil.ReadAll()
followed by kcDst.PutB()
, how about:
reader, size, _, err := kcSrc.Get(getLocator)
// [check err]
_, _, err = kcDst.PutHR(getLocator[:32], reader, size)
// [check err]
Updated by Tom Clegg over 9 years ago
Some tests like TestLoadConfig don't actually use servers, so they shouldn't be part of ServerRequiredSuite (this should save a bunch of time starting and stopping Keep servers via Python wrapper).
We should log total #blocks on src and dst before starting to copy data (otherwise it's hard to see the difference between "not much to do because not much src data found" and "not much to do because dst is already nearly complete").
This should have+ ".conf"
at the end:
filename = os.Getenv("HOME") + "/.config/arvados/" + filename
- "The name of the source Arvados instance (required) - points at an Arvados config file. May be either a pathname to a config file, or (for example) "foo" as shorthand for $HOME/.config/arvados/foo.conf"
Updated by Radhika Chippada over 9 years ago
The TestRsyncPutInOne_GetFromOtherShouldFail and TestRsyncWithBlobSigning_PutInOne_GetFromOtherShouldFail tests
Updated
Make dstIndex accessible to the individual test cases
Done
arvadostest.StartAPI() should move from setupRsync() to SetUpSuite()
Done
arvadostest.StopKeep() should move from TearDownSuite() to TearDownTest()
Done (this was unintentional and now corrected)
The setupRsync arguments updateDstReplications bool, replications int seem to be redundant.
Updated
setupRsync calls setupKeepClients with a replications arg, but then it overwrites ...
Improved this
The first if prefix "" { case in testKeepRsync looks unnecessary: ...
Done
The "else" side of the second if prefix == "" { case should comment that ...
Updated comment
After 1 block is done, the progress display should show ETA ...
Done. The acronym ETA does not seem correct in this context. May be we should say, "estimated time remaining ..."?
Instead of ioutil.ReadAll() followed by kcDst.PutB(), how about ...
Done
Some tests like TestLoadConfig don't actually use servers ...
Added ServerNotRequiredSuite
We should log total #blocks on src and dst before starting to copy data
Done
This should have + ".conf" at the end: filename = os.Getenv("HOME") + "/.config/arvados/" + filename
Done
Update help message for -src and -dst. Maybe just copy the arv-copy help messages:
Updated comment
Updated by Tom Clegg over 9 years ago
- change conf args to -src an -dst to match spec and arv-copy
- update ETA logging (remove redundant "copying block X of X", compute ETA based on average time for all blocks done so far, not just the very first block)
- in arvadostest.go, always pass --num-keep-servers to python wrapper, and update comments
The rest LGTM, thanks. I'm waiting for the full test suite to finish. If the above changes look good & pass tests for you, please merge. Thanks!
Updated by Radhika Chippada over 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 67 to 100
Applied in changeset arvados|commit:e88a0a3ff00fd5448f893909d08ee11dc301776e.
Updated by Radhika Chippada over 9 years ago
(Even though I have not reopened this ticket "yet"), I was unhappy with the fact that a big chunk of the code in the keep-rsync -> main method was not tested. The main func had all the arg parsing code that had not been tested. Unfortunately, this is the problem with all our go code where we have a main func. So, I researched a little, after discussing with Tom, and:
- extended the keep-rsync main func. All the code from main is now moved into a separate doMain func.
- doMain func returns error on any errors and main does the usual Fatalf on any errors.
- Args are parsed using flag.FlagSet rather than the default flag. This is needed if we need to set flags several time from different tests.
- Another set of tests are added to pass various combination of good and bad flags. Also added a test that uses good config files and does end-to-end keep-rsync test (though this does not actually do any copying to dst, it tests a lot of the other code).
- Happy to report that we can now test other go code with main func using this code as an example
- Also, tested the script manually at command prompt using the command:
go run keep-rsync.go -src test -replications x -dst test
- commit c9018ad3b7f67f2f15a6a6671bf271858926586e has this update. Please review and let me know any comments.
- commit a9e08c498ac8126026d1226f34913f87c8e590a0 removes StartKeepWithParams and StopKeepWithParams funcs and exposes only StartKeep and StopKeep with the parameters. All the earlier references are updated.
Thanks.