Bug #7167: [Deployment] Write an efficient Keep migration script - Arvados

Bug #7167

This is a script aimed at system administrators who are migrating a cluster from one installation to another.    It copies Keep data from the old to the new cluster, in a way that's efficient both for the migration itself, and for accessing data on the destination cluster (in other words, blocks live on services early in their rendezvous hash order). 

 h2. Functional requirements 

 * The script dynamically finds all blocks available on the source cluster. _This can only be done by getting the "index" from each keepstore on the source side._ 
 * Get each block from the source cluster exactly once, and write it to the destination cluster, using standard Keep APIs and algorithms (e.g., rendezvous hashing, checksum validation). _This can be done with the existing Keep SDKs._ 
 * Include a checkpointing mechanism so that if the process is interrupted, it has a record of what blocks have already been copied and doesn't re-send them. _In the implementation below, the keep block index on the destination side serves as the checkpoint mechanism._ 
 * When writing a block on the destination side, use the destination cluster's default replication level, as given in the discovery document. 
 * The "destination cluster" may just be a series of Keepstores that are being prepped to replace an existing cluster.    It must be possible for the administrator to get data copied to that destination without an API server in front of them. 

 Possible future work (specifically excluded from the requirements here): 
 * Determine the desired replication level for each block by reading all collection records from the source cluster, and write to the destination cluster based on that information. (Until then, keep-rsync will use the destination cluster's default replication level, leaving further adjustments to the destination cluster's Data Manager after the database has been migrated.) 
 * Verify integrity of blocks that (according to the checkpoint/index data on the destination side) already exist on the destination side. For now, we assume that some other mechanism is responsible for ensuring corrupt blocks aren't listed in keepstore index responses. 

 h2. Implementation 

 keep-rsync will be written in Go. Source code will live in source:services/keep-rsync. Debian/RedHat packages, and the binaries they install, will be called keep-rsync. 
 * Accepts @-src@ "src" and @-dst@ "dst" arguments and reads settings/conf files just like arv-copy. 
 ** Reads @ARVADOS_BLOB_SIGNING_KEY@ from the settings files in addition to the usual @*_HOST@, @*_HOST_INSECURE@, and @*_TOKEN@ entries. The @ARVADOS_API_TOKEN@ entry in each settings file must be the "data manager token" recognized by the relevant Keep servers. 
 * Accepts _optional_ @-dst-keep-services-json@ (and @-src-keep-services-json@ for good measure) arguments, giving files whose contents look just like the output of "arv --json keep_services accessible". This will allow the user to control the dst/src Keep services in situations where the relevant API service isn't working/reachable/configured. If not given, let keepclient discover keep services as usual. 
 * Accepts a @-replication@ argument (default to whatever is advertised in "destination" discovery doc). 
 * Accepts a @-prefix@ argument that passes through to index requests on both sides. This makes it possible to divide the work into (e.g.) 16 asynchronous jobs, one for each hex digit. 
 * Gets indexes from the source and destination keepstores. keepstores[1]. 
 * Gets data from source keepstores/keepproxy, stores in destination using configured replication level. 
 * Uses regular SDK functions to get and put blocks. 
 * Displays progress. 
 ** "getting indexes: 10... 9... [...]" (count down number of indexes todo) 
 ** "copying data block 1 of 1234 (0% done, ETA 2m3s): acbd18db4cc2f85cedef654fccc4a4d8+3" 

 h2. Usage example 

 How to use in a migration: 
 * Turn off data manager on destination cluster. 
 * Run keep-rsync. 
 * Disable access to source cluster. 
 * Dump database and restore to destination cluster. 
 * Run keep-rsync again.

Back

Project

General

Profile

Arvados

Bug #7167