Migrating from arvados-node-manager to arvados-dispatch-cloud » History » Version 19
Tom Clegg, 03/25/2019 05:18 PM
1 | 5 | Tom Clegg | h1. Migrating from arvados-node-manager to arvados-dispatch-cloud |
---|---|---|---|
2 | 1 | Tom Clegg | |
3 | {{toc}} |
||
4 | |||
5 | h2. Choose a node |
||
6 | |||
7 | The dispatch service can run on any host that can connect to the Arvados API service, the cloud provider's API, and the SSH service on cloud VMs. In the following example it runs on the same node as the API server and controller. |
||
8 | |||
9 | 4 | Tom Clegg | h2. Prepare key pair and worker VM image |
10 | |||
11 | 9 | Tom Clegg | Generate an SSH private key with no passphrase. Save it in the cluster configuration file (see @PrivateKey@ in the example below). |
12 | 4 | Tom Clegg | |
13 | 19 | Tom Clegg | If you are using Azure or EC2, the dispatcher will create a login account and install your public key automatically, so you do *not* need to save the corresponding public key in an authorized_keys file in the VM image (or anywhere else, for that matter). |
14 | 4 | Tom Clegg | |
15 | 14 | Tom Clegg | Prepare a worker VM image. It needs docker, arv-mount (python-arvados-fuse), and crunch-run ≥ 1.3.1.20190221194156. |
16 | 4 | Tom Clegg | |
17 | 1 | Tom Clegg | h2. Update cluster configuration file |
18 | |||
19 | In @/etc/arvados/config.yml@, add configuration items for the dispatch service. |
||
20 | |||
21 | <pre><code class="yaml"> |
||
22 | Clusters: |
||
23 | 8 | Tom Clegg | zzzzz: |
24 | 18 | Tom Clegg | ManagementToken: "example-secret-management-token" |
25 | SystemRootToken: zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz # a superuser token |
||
26 | Services: |
||
27 | Controller: |
||
28 | ExternalURL: https://zzzzz.arvadosapi.com |
||
29 | TLS: |
||
30 | #Insecure: true # uncomment to bypass TLS certificate verification |
||
31 | 1 | Tom Clegg | CloudVMs: |
32 | BootProbeCommand: "mount | grep /mnt/scratch" |
||
33 | 16 | Tom Clegg | MaxCloudOpsPerSecond: 10 |
34 | 1 | Tom Clegg | SSHPort: "2222" |
35 | SyncInterval: 1m |
||
36 | TimeoutIdle: 2m |
||
37 | TimeoutBooting: 10m |
||
38 | TimeoutProbe: 5m |
||
39 | TimeoutShutdown: 30s |
||
40 | 8 | Tom Clegg | ImageID: "https://zzzzzzzz.blob.core.windows.net/system/Microsoft.Compute/Images/images/zzzzz-compute-osDisk.55555555-5555-5555-5555-555555555555.vhd" |
41 | 7 | Ward Vandewege | Driver: azure |
42 | 1 | Tom Clegg | DriverParameters: |
43 | 2 | Tom Clegg | SubscriptionID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX |
44 | ClientID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX |
||
45 | 1 | Tom Clegg | ClientSecret: 2WyXt0XFbEtutnf2hp528t6Wk9S5bOHWkRaaWwavKQo= |
46 | 8 | Tom Clegg | TenantID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX |
47 | 11 | Tom Clegg | CloudEnvironment: AzurePublicCloud |
48 | 2 | Tom Clegg | ResourceGroup: zzzzz |
49 | Location: centralus |
||
50 | 3 | Tom Clegg | Network: zzzzz |
51 | 2 | Tom Clegg | Subnet: zzzzz-subnet-private |
52 | 3 | Tom Clegg | StorageAccount: example |
53 | 2 | Tom Clegg | BlobContainer: vhds |
54 | 11 | Tom Clegg | DeleteDanglingResourcesAfter: 20s |
55 | 8 | Tom Clegg | AdminUsername: arvados |
56 | 1 | Tom Clegg | Dispatch: |
57 | 4 | Tom Clegg | PrivateKey: | |
58 | -----BEGIN RSA PRIVATE KEY----- |
||
59 | MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ |
||
60 | 0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E |
||
61 | GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV |
||
62 | mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7 |
||
63 | ... |
||
64 | JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN |
||
65 | ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI |
||
66 | pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9 |
||
67 | 1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ |
||
68 | -----END RSA PRIVATE KEY----- |
||
69 | 1 | Tom Clegg | StaleLockTimeout: 1m |
70 | PollInterval: 10s |
||
71 | ProbeInterval: 10s |
||
72 | MaxProbesPerSecond: 10 |
||
73 | 16 | Tom Clegg | TimeoutSignal: 5s |
74 | TimeoutTERM: 2m |
||
75 | 1 | Tom Clegg | InstanceTypes: |
76 | x1lg: |
||
77 | ProviderType: x1.large |
||
78 | 15 | Tom Clegg | VCPUs: 16 |
79 | RAM: 128GiB |
||
80 | 1 | Tom Clegg | IncludedScratch: 128GiB |
81 | Price: 1.23 |
||
82 | NodeProfiles: |
||
83 | 8 | Tom Clegg | dispatcher: # references ARVADOS_NODE_PROFILE in environment file (see below). |
84 | 1 | Tom Clegg | arvados-dispatch-cloud: |
85 | 10 | Ward Vandewege | Listen: ":9006" |
86 | 1 | Tom Clegg | </code></pre> |
87 | |||
88 | Create the host configuration file @/etc/arvados/environment@. |
||
89 | |||
90 | <pre> |
||
91 | 8 | Tom Clegg | ARVADOS_NODE_PROFILE=dispatcher |
92 | 1 | Tom Clegg | </pre> |
93 | |||
94 | h2. Stop crunch-dispatch-slurm |
||
95 | |||
96 | Stop and disable the crunch-dispatch-slurm service, and uninstall the package to make sure it doesn't start after the next reboot/upgrade. |
||
97 | |||
98 | <pre> |
||
99 | # systemctl stop crunch-dispatch-slurm |
||
100 | # systemctl disable crunch-dispatch-slurm |
||
101 | # apt-get remove crunch-dispatch-slurm |
||
102 | </pre> |
||
103 | |||
104 | 5 | Tom Clegg | Containers that have already been locked and submitted to SLURM will make their way through the SLURM queue, but newly queued containers will be left for arvados-dispatch-cloud to run. |
105 | 1 | Tom Clegg | |
106 | 5 | Tom Clegg | h2. Install arvados-dispatch-cloud |
107 | 1 | Tom Clegg | |
108 | <pre> |
||
109 | 5 | Tom Clegg | # apt-get install arvados-dispatch-cloud |
110 | 1 | Tom Clegg | </pre> |
111 | |||
112 | 13 | Tom Clegg | For now, @ARVADOS_API_HOST@ and @ARVADOS_API_TOKEN@ environment variables must be provided (future versions will get these values from the config file). You can use @systemctl edit@ to do this through systemd: |
113 | |||
114 | <pre> |
||
115 | [Service] |
||
116 | Environment=ARVADOS_API_HOST=zzzzz.arvadosapi.com |
||
117 | Environment=ARVADOS_API_TOKEN=zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz |
||
118 | </pre> |
||
119 | |||
120 | |||
121 | 1 | Tom Clegg | h2. Verify the service is running |
122 | |||
123 | <pre> |
||
124 | $ token="example-secret-management-token" |
||
125 | $ curl -H "Authorization: Bearer $token" http://localhost:9005/metrics |
||
126 | </pre> |
||
127 | |||
128 | h2. Verify the service is functional |
||
129 | 5 | Tom Clegg | |
130 | Watch the dispatcher's logs while you run an Arvados container: |
||
131 | |||
132 | <pre> |
||
133 | # journalctl -ocat -fu arvados-dispatch-cloud |
||
134 | </pre> |
||
135 | 12 | Tom Clegg | |
136 | Example logs: |
||
137 | |||
138 | <pre> |
||
139 | Starting Arvados dispatch cloud... |
||
140 | {"Listen":"[::]:9006","PID":46639,"Service":"arvados-dispatch-cloud","level":"info","msg":"listening","time":"2019-02-18T18:10:33.550358536Z"} |
||
141 | Started Arvados dispatch cloud. |
||
142 | {"PID":46639,"level":"info","msg":"FixStaleLocks starting.","time":"2019-02-18T18:10:33.706568502Z"} |
||
143 | {"PID":46639,"level":"info","msg":"FixStaleLocks finished (34.717µs), starting scheduling.","time":"2019-02-18T18:10:33.706606521Z"} |
||
144 | {"N":0,"PID":46639,"level":"info","msg":"loaded initial instance list","time":"2019-02-18T18:10:33.982989844Z"} |
||
145 | {"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","InstanceType":"Standard_D1_v2","PID":46639,"Priority":1124349393114703,"State":"Queued","level":"info","msg":"added container to queue","time":"2019-02-18T18:15:33.620474859Z"} |
||
146 | {"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","InstanceType":"Standard_D1_v2","PID":46639,"level":"info","msg":"creating new instance","time":"2019-02-18T18:15:33.711915757Z"} |
||
147 | {"Address":"10.25.64.8","IdleBehavior":"run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"State":"unknown","level":"info","msg":"instance appeared in cloud","time":"2019-02-18T18:16:34.512277597Z"} |
||
148 | {"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"dial tcp 10.25.64.8:2222: connect: connection refused","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:16:43.386626115Z"} |
||
149 | {"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"dial tcp 10.25.64.8:2222: connect: connection refused","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:16:53.381814784Z"} |
||
150 | {"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"Process exited with status 1","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:17:04.897049820Z"} |
||
151 | {"Address":"10.25.64.8","Command":"/bin/ls /arvados-compute-node-boot.complete \u003e/dev/null 2\u003e\u00261 \u0026\u0026 sudo wget --quiet https://c97qk.arvadosapi.com/crunch-run --output-document=/usr/bin/crunch-run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2019-02-18T18:17:33.866226306Z"} |
||
152 | {"Address":"10.25.64.8","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:17:33.357445643Z","level":"info","msg":"instance booted; will try probeRunning","time":"2019-02-18T18:17:33.866286736Z"} |
||
153 | {"Address":"10.25.64.8","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:17:33.357445643Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2019-02-18T18:17:33.886177977Z"} |
||
154 | {"Address":"10.25.64.8","ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"Priority":1124349393114703,"level":"info","msg":"crunch-run process started","time":"2019-02-18T18:17:33.903776572Z"} |
||
155 | {"Address":"10.25.64.8","ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:19:03.315677365Z","level":"info","msg":"crunch-run process ended","time":"2019-02-18T18:19:03.334773308Z"} |
||
156 | {"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","PID":46639,"State":"Complete","level":"info","msg":"dropped container from queue","time":"2019-02-18T18:19:13.500015828Z"} |
||
157 | {"Address":"10.25.64.8","Age":129980928971,"IdleBehavior":"run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"State":"idle","level":"info","msg":"shutdown idle worker","time":"2019-02-18T18:21:13.255492437Z"} |
||
158 | {"PID":46639,"level":"info","msg":"Will delete compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-nic because it is older than 20s","time":"2019-02-18T18:22:35.044805153Z"} |
||
159 | {"Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","PID":46639,"WorkerState":"shutdown","level":"info","msg":"instance disappeared in cloud","time":"2019-02-18T18:22:35.086986833Z"} |
||
160 | {"PID":46639,"level":"info","msg":"Deleted NIC compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-nic","time":"2019-02-18T18:22:45.273921501Z"} |
||
161 | {"PID":46639,"level":"info","msg":"Blob compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-os.vhd is unlocked and not modified for 209.176892224 seconds, will delete","time":"2019-02-18T18:25:33.188314532Z"} |
||
162 | {"PID":46639,"level":"info","msg":"Deleted blob compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-os.vhd","time":"2019-02-18T18:25:33.194356552Z"} |
||
163 | </pre> |