Project

General

Profile

Migrating from arvados-node-manager to arvados-dispatch-cloud » History » Revision 13

Revision 12 (Tom Clegg, 02/18/2019 06:37 PM) → Revision 13/22 (Tom Clegg, 02/21/2019 10:10 PM)

h1. Migrating from arvados-node-manager to arvados-dispatch-cloud 

 {{toc}} 

 h2. Choose a node 

 The dispatch service can run on any host that can connect to the Arvados API service, the cloud provider's API, and the SSH service on cloud VMs. In the following example it runs on the same node as the API server and controller. 

 h2. Prepare key pair and worker VM image 

 Generate an SSH private key with no passphrase. Save it in the cluster configuration file (see @PrivateKey@ in the example below). 

 If you are using Azure, the dispatcher will create a login account and install your public key automatically, so you do *not* need to save the corresponding public key in an authorized_keys file in the VM image (or anywhere else, for that matter). 

 Prepare a worker VM image. It needs docker, arv-mount (python-arvados-fuse), and crunch-run. The version of crunch-run must be new enough to include commit:2873d55ea (TODO: when merged/published, give minimum package version instead of commit). 

 h2. Update cluster configuration file 

 In @/etc/arvados/config.yml@, add configuration items for the dispatch service. 

 <pre><code class="yaml"> 
 Clusters: 
   zzzzz: 
     CloudVMs: 
       BootProbeCommand: "mount | grep /mnt/scratch" 
       SSHPort: "2222" 
       SyncInterval: 1m 
       TimeoutIdle: 2m 
       TimeoutBooting: 10m 
       TimeoutProbe: 5m 
       TimeoutShutdown: 30s 
       ImageID: "https://zzzzzzzz.blob.core.windows.net/system/Microsoft.Compute/Images/images/zzzzz-compute-osDisk.55555555-5555-5555-5555-555555555555.vhd" 
       Driver: azure 
       DriverParameters: 
         SubscriptionID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX 
         ClientID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX 
         ClientSecret: 2WyXt0XFbEtutnf2hp528t6Wk9S5bOHWkRaaWwavKQo= 
         TenantID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX 
         CloudEnvironment: AzurePublicCloud 
         ResourceGroup: zzzzz 
         Location: centralus 
         Network: zzzzz 
         Subnet: zzzzz-subnet-private 
         StorageAccount: example 
         BlobContainer: vhds 
         DeleteDanglingResourcesAfter: 20s 
         AdminUsername: arvados 
     Dispatch: 
       PrivateKey: | 
         -----BEGIN RSA PRIVATE KEY----- 
         MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ 
         0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E 
         GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV 
         mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7 
         ... 
         JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN 
         ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI 
         pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9 
         1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ 
         -----END RSA PRIVATE KEY----- 
       StaleLockTimeout: 1m 
       PollInterval: 10s 
       ProbeInterval: 10s 
       MaxProbesPerSecond: 10 
     InstanceTypes: 
       x1lg: 
         ProviderType: x1.large 
         VCPUs: 16 
         RAM: 128G 
         Scratch: 128G 
         Price: 1.23 
     ManagementToken: "example-secret-management-token" 
     NodeProfiles: 
       dispatcher:                         # references ARVADOS_NODE_PROFILE in environment file (see below). 
         arvados-dispatch-cloud: 
           Listen: ":9006" 
 </code></pre> 

 Create the host configuration file @/etc/arvados/environment@. 

 <pre> 
 ARVADOS_NODE_PROFILE=dispatcher 
 </pre> 

 h2. Stop crunch-dispatch-slurm 

 Stop and disable the crunch-dispatch-slurm service, and uninstall the package to make sure it doesn't start after the next reboot/upgrade. 

 <pre> 
 # systemctl stop crunch-dispatch-slurm 
 # systemctl disable crunch-dispatch-slurm 
 # apt-get remove crunch-dispatch-slurm 
 </pre> 

 Containers that have already been locked and submitted to SLURM will make their way through the SLURM queue, but newly queued containers will be left for arvados-dispatch-cloud to run. 

 

 h2. Install arvados-dispatch-cloud 

 <pre> 
 # apt-get install arvados-dispatch-cloud 
 </pre> 

 For now, @ARVADOS_API_HOST@ and @ARVADOS_API_TOKEN@ environment variables must be provided (future versions will get these values from the config file). You can use @systemctl edit@ to do this through systemd: 

 <pre> 
 [Service] 
 Environment=ARVADOS_API_HOST=zzzzz.arvadosapi.com 
 Environment=ARVADOS_API_TOKEN=zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz 
 </pre> 


 h2. Verify the service is running 

 <pre> 
 $ token="example-secret-management-token" 
 $ curl -H "Authorization: Bearer $token" http://localhost:9005/metrics 
 </pre> 

 h2. Verify the service is functional 

 Watch the dispatcher's logs while you run an Arvados container: 

 <pre> 
 # journalctl -ocat -fu arvados-dispatch-cloud 
 </pre> 

 Example logs: 

 <pre> 
 Starting Arvados dispatch cloud... 
 {"Listen":"[::]:9006","PID":46639,"Service":"arvados-dispatch-cloud","level":"info","msg":"listening","time":"2019-02-18T18:10:33.550358536Z"} 
 Started Arvados dispatch cloud. 
 {"PID":46639,"level":"info","msg":"FixStaleLocks starting.","time":"2019-02-18T18:10:33.706568502Z"} 
 {"PID":46639,"level":"info","msg":"FixStaleLocks finished (34.717µs), starting scheduling.","time":"2019-02-18T18:10:33.706606521Z"} 
 {"N":0,"PID":46639,"level":"info","msg":"loaded initial instance list","time":"2019-02-18T18:10:33.982989844Z"} 
 {"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","InstanceType":"Standard_D1_v2","PID":46639,"Priority":1124349393114703,"State":"Queued","level":"info","msg":"added container to queue","time":"2019-02-18T18:15:33.620474859Z"} 
 {"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","InstanceType":"Standard_D1_v2","PID":46639,"level":"info","msg":"creating new instance","time":"2019-02-18T18:15:33.711915757Z"} 
 {"Address":"10.25.64.8","IdleBehavior":"run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"State":"unknown","level":"info","msg":"instance appeared in cloud","time":"2019-02-18T18:16:34.512277597Z"} 
 {"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"dial tcp 10.25.64.8:2222: connect: connection refused","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:16:43.386626115Z"} 
 {"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"dial tcp 10.25.64.8:2222: connect: connection refused","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:16:53.381814784Z"} 
 {"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"Process exited with status 1","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:17:04.897049820Z"} 
 {"Address":"10.25.64.8","Command":"/bin/ls /arvados-compute-node-boot.complete    \u003e/dev/null 2\u003e\u00261 \u0026\u0026 sudo wget --quiet https://c97qk.arvadosapi.com/crunch-run --output-document=/usr/bin/crunch-run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2019-02-18T18:17:33.866226306Z"} 
 {"Address":"10.25.64.8","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:17:33.357445643Z","level":"info","msg":"instance booted; will try probeRunning","time":"2019-02-18T18:17:33.866286736Z"} 
 {"Address":"10.25.64.8","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:17:33.357445643Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2019-02-18T18:17:33.886177977Z"} 
 {"Address":"10.25.64.8","ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"Priority":1124349393114703,"level":"info","msg":"crunch-run process started","time":"2019-02-18T18:17:33.903776572Z"} 
 {"Address":"10.25.64.8","ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:19:03.315677365Z","level":"info","msg":"crunch-run process ended","time":"2019-02-18T18:19:03.334773308Z"} 
 {"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","PID":46639,"State":"Complete","level":"info","msg":"dropped container from queue","time":"2019-02-18T18:19:13.500015828Z"} 
 {"Address":"10.25.64.8","Age":129980928971,"IdleBehavior":"run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"State":"idle","level":"info","msg":"shutdown idle worker","time":"2019-02-18T18:21:13.255492437Z"} 
 {"PID":46639,"level":"info","msg":"Will delete compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-nic because it is older than 20s","time":"2019-02-18T18:22:35.044805153Z"} 
 {"Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","PID":46639,"WorkerState":"shutdown","level":"info","msg":"instance disappeared in cloud","time":"2019-02-18T18:22:35.086986833Z"} 
 {"PID":46639,"level":"info","msg":"Deleted NIC compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-nic","time":"2019-02-18T18:22:45.273921501Z"} 
 {"PID":46639,"level":"info","msg":"Blob compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-os.vhd is unlocked and not modified for 209.176892224 seconds, will delete","time":"2019-02-18T18:25:33.188314532Z"} 
 {"PID":46639,"level":"info","msg":"Deleted blob compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-os.vhd","time":"2019-02-18T18:25:33.194356552Z"} 
 </pre>