Migrating from arvados-node-manager to arvados-dispatch-cloud » History » Version 7
Ward Vandewege, 02/12/2019 04:34 PM
1 | 5 | Tom Clegg | h1. Migrating from arvados-node-manager to arvados-dispatch-cloud |
---|---|---|---|
2 | 1 | Tom Clegg | |
3 | {{toc}} |
||
4 | |||
5 | h2. Choose a node |
||
6 | |||
7 | The dispatch service can run on any host that can connect to the Arvados API service, the cloud provider's API, and the SSH service on cloud VMs. In the following example it runs on the same node as the API server and controller. |
||
8 | |||
9 | 4 | Tom Clegg | h2. Prepare key pair and worker VM image |
10 | |||
11 | Generate an SSH key pair. |
||
12 | |||
13 | Save the public key in @/root/.ssh/authorized_keys@ in the worker VM image. |
||
14 | |||
15 | Save the private key in the cluster configuration file (see @PrivateKey@ in the example below). |
||
16 | |||
17 | 1 | Tom Clegg | h2. Update cluster configuration file |
18 | |||
19 | In @/etc/arvados/config.yml@, add configuration items for the dispatch service. |
||
20 | |||
21 | <pre><code class="yaml"> |
||
22 | Clusters: |
||
23 | uuid_prefix: |
||
24 | CloudVMs: |
||
25 | BootProbeCommand: "mount | grep /mnt/scratch" |
||
26 | SSHPort: "2222" |
||
27 | SyncInterval: 1m |
||
28 | TimeoutIdle: 2m |
||
29 | TimeoutBooting: 10m |
||
30 | TimeoutProbe: 5m |
||
31 | TimeoutShutdown: 30s |
||
32 | ImageID: "image-12345678" |
||
33 | 7 | Ward Vandewege | Driver: azure |
34 | 1 | Tom Clegg | DriverParameters: |
35 | 2 | Tom Clegg | SubscriptionID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX |
36 | 3 | Tom Clegg | subscription_id: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX # not needed after #14745 |
37 | 2 | Tom Clegg | ClientID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX |
38 | 3 | Tom Clegg | key: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX # not needed after #14745 (same value as ClientID) |
39 | 2 | Tom Clegg | ClientSecret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |
40 | 3 | Tom Clegg | secret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # not needed after #14745 (same value as ClientSecret) |
41 | 2 | Tom Clegg | TenantID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX |
42 | 3 | Tom Clegg | tenant_id: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX # not needed after #14745 |
43 | 2 | Tom Clegg | CloudEnv: AzurePublicCloud |
44 | 3 | Tom Clegg | cloud_environment: AzurePublicCloud # not needed after #14745 |
45 | 2 | Tom Clegg | ResourceGroup: zzzzz |
46 | 6 | Ward Vandewege | resource_group: zzzzz # not needed after #14745 |
47 | 2 | Tom Clegg | Location: centralus |
48 | 3 | Tom Clegg | region: centralus # not needed after #14745 (same value as Location) |
49 | 2 | Tom Clegg | Network: zzzzz |
50 | Subnet: zzzzz-subnet-private |
||
51 | 3 | Tom Clegg | StorageAccount: example |
52 | 2 | Tom Clegg | storage_account: example # not needed after #14745 |
53 | 3 | Tom Clegg | BlobContainer: vhds |
54 | 2 | Tom Clegg | blob_container: vhds # not needed after #14745 |
55 | 3 | Tom Clegg | DeleteDanglingResourcesAfter: 20 |
56 | 1 | Tom Clegg | delete_dangling_resources_after: 20 # not needed after #14745 |
57 | Dispatch: |
||
58 | 4 | Tom Clegg | PrivateKey: | |
59 | -----BEGIN RSA PRIVATE KEY----- |
||
60 | MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ |
||
61 | 0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E |
||
62 | GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV |
||
63 | mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7 |
||
64 | ... |
||
65 | JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN |
||
66 | ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI |
||
67 | pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9 |
||
68 | 1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ |
||
69 | -----END RSA PRIVATE KEY----- |
||
70 | 1 | Tom Clegg | StaleLockTimeout: 1m |
71 | PollInterval: 10s |
||
72 | ProbeInterval: 10s |
||
73 | MaxProbesPerSecond: 10 |
||
74 | InstanceTypes: |
||
75 | x1lg: |
||
76 | ProviderType: x1.large |
||
77 | VCPUs: 16 |
||
78 | RAM: 128G |
||
79 | Scratch: 128G |
||
80 | Price: 1.23 |
||
81 | ManagementToken: "example-secret-management-token" |
||
82 | NodeProfiles: |
||
83 | apiserver: # references ARVADOS_NODE_PROFILE in environment file (see below). |
||
84 | arvados-dispatch-cloud: |
||
85 | Listen: ":9005" |
||
86 | </code></pre> |
||
87 | |||
88 | Create the host configuration file @/etc/arvados/environment@. |
||
89 | |||
90 | <pre> |
||
91 | ARVADOS_NODE_PROFILE=apiserver |
||
92 | </pre> |
||
93 | |||
94 | h2. Stop crunch-dispatch-slurm |
||
95 | |||
96 | Stop and disable the crunch-dispatch-slurm service, and uninstall the package to make sure it doesn't start after the next reboot/upgrade. |
||
97 | |||
98 | <pre> |
||
99 | # systemctl stop crunch-dispatch-slurm |
||
100 | # systemctl disable crunch-dispatch-slurm |
||
101 | # apt-get remove crunch-dispatch-slurm |
||
102 | </pre> |
||
103 | |||
104 | 5 | Tom Clegg | Containers that have already been locked and submitted to SLURM will make their way through the SLURM queue, but newly queued containers will be left for arvados-dispatch-cloud to run. |
105 | 1 | Tom Clegg | |
106 | 5 | Tom Clegg | h2. Install arvados-dispatch-cloud |
107 | 1 | Tom Clegg | |
108 | <pre> |
||
109 | 5 | Tom Clegg | # apt-get install arvados-dispatch-cloud |
110 | 1 | Tom Clegg | </pre> |
111 | |||
112 | h2. Verify the service is running |
||
113 | |||
114 | <pre> |
||
115 | $ token="example-secret-management-token" |
||
116 | $ curl -H "Authorization: Bearer $token" http://localhost:9005/metrics |
||
117 | </pre> |
||
118 | |||
119 | h2. Verify the service is functional |
||
120 | 5 | Tom Clegg | |
121 | Watch the dispatcher's logs while you run an Arvados container: |
||
122 | |||
123 | <pre> |
||
124 | # journalctl -ocat -fu arvados-dispatch-cloud |
||
125 | </pre> |