Migrating from arvados-node-manager to arvados-dispatch-cloud » History » Version 16

Tom Clegg, 03/19/2019 03:11 PM

1 5 Tom Clegg
h1. Migrating from arvados-node-manager to arvados-dispatch-cloud
2 1 Tom Clegg
3 1 Tom Clegg
{{toc}}
4 1 Tom Clegg
5 1 Tom Clegg
h2. Choose a node
6 1 Tom Clegg
7 1 Tom Clegg
The dispatch service can run on any host that can connect to the Arvados API service, the cloud provider's API, and the SSH service on cloud VMs. In the following example it runs on the same node as the API server and controller.
8 1 Tom Clegg
9 4 Tom Clegg
h2. Prepare key pair and worker VM image
10 4 Tom Clegg
11 9 Tom Clegg
Generate an SSH private key with no passphrase. Save it in the cluster configuration file (see @PrivateKey@ in the example below).
12 4 Tom Clegg
13 9 Tom Clegg
If you are using Azure, the dispatcher will create a login account and install your public key automatically, so you do *not* need to save the corresponding public key in an authorized_keys file in the VM image (or anywhere else, for that matter).
14 4 Tom Clegg
15 14 Tom Clegg
Prepare a worker VM image. It needs docker, arv-mount (python-arvados-fuse), and crunch-run ≥ 1.3.1.20190221194156.
16 4 Tom Clegg
17 1 Tom Clegg
h2. Update cluster configuration file
18 1 Tom Clegg
19 1 Tom Clegg
In @/etc/arvados/config.yml@, add configuration items for the dispatch service.
20 1 Tom Clegg
21 1 Tom Clegg
<pre><code class="yaml">
22 1 Tom Clegg
Clusters:
23 8 Tom Clegg
  zzzzz:
24 1 Tom Clegg
    CloudVMs:
25 1 Tom Clegg
      BootProbeCommand: "mount | grep /mnt/scratch"
26 16 Tom Clegg
      MaxCloudOpsPerSecond: 10
27 1 Tom Clegg
      SSHPort: "2222"
28 1 Tom Clegg
      SyncInterval: 1m
29 1 Tom Clegg
      TimeoutIdle: 2m
30 1 Tom Clegg
      TimeoutBooting: 10m
31 1 Tom Clegg
      TimeoutProbe: 5m
32 1 Tom Clegg
      TimeoutShutdown: 30s
33 8 Tom Clegg
      ImageID: "https://zzzzzzzz.blob.core.windows.net/system/Microsoft.Compute/Images/images/zzzzz-compute-osDisk.55555555-5555-5555-5555-555555555555.vhd"
34 7 Ward Vandewege
      Driver: azure
35 1 Tom Clegg
      DriverParameters:
36 2 Tom Clegg
        SubscriptionID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
37 2 Tom Clegg
        ClientID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
38 1 Tom Clegg
        ClientSecret: 2WyXt0XFbEtutnf2hp528t6Wk9S5bOHWkRaaWwavKQo=
39 8 Tom Clegg
        TenantID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
40 11 Tom Clegg
        CloudEnvironment: AzurePublicCloud
41 2 Tom Clegg
        ResourceGroup: zzzzz
42 2 Tom Clegg
        Location: centralus
43 3 Tom Clegg
        Network: zzzzz
44 2 Tom Clegg
        Subnet: zzzzz-subnet-private
45 3 Tom Clegg
        StorageAccount: example
46 2 Tom Clegg
        BlobContainer: vhds
47 11 Tom Clegg
        DeleteDanglingResourcesAfter: 20s
48 8 Tom Clegg
        AdminUsername: arvados
49 1 Tom Clegg
    Dispatch:
50 4 Tom Clegg
      PrivateKey: |
51 4 Tom Clegg
        -----BEGIN RSA PRIVATE KEY-----
52 4 Tom Clegg
        MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ
53 4 Tom Clegg
        0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E
54 4 Tom Clegg
        GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV
55 4 Tom Clegg
        mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7
56 4 Tom Clegg
        ...
57 4 Tom Clegg
        JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN
58 4 Tom Clegg
        ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI
59 4 Tom Clegg
        pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9
60 4 Tom Clegg
        1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ
61 4 Tom Clegg
        -----END RSA PRIVATE KEY-----
62 1 Tom Clegg
      StaleLockTimeout: 1m
63 1 Tom Clegg
      PollInterval: 10s
64 1 Tom Clegg
      ProbeInterval: 10s
65 1 Tom Clegg
      MaxProbesPerSecond: 10
66 16 Tom Clegg
      TimeoutSignal: 5s
67 16 Tom Clegg
      TimeoutTERM: 2m
68 16 Tom Clegg
      TimeoutKILL: 20s
69 1 Tom Clegg
    InstanceTypes:
70 1 Tom Clegg
      x1lg:
71 1 Tom Clegg
        ProviderType: x1.large
72 1 Tom Clegg
        VCPUs: 16
73 15 Tom Clegg
        RAM: 128GiB
74 15 Tom Clegg
        IncludedScratch: 128GiB
75 1 Tom Clegg
        Price: 1.23
76 1 Tom Clegg
    ManagementToken: "example-secret-management-token"
77 1 Tom Clegg
    NodeProfiles:
78 8 Tom Clegg
      dispatcher:                       # references ARVADOS_NODE_PROFILE in environment file (see below).
79 1 Tom Clegg
        arvados-dispatch-cloud:
80 10 Ward Vandewege
          Listen: ":9006"
81 1 Tom Clegg
</code></pre>
82 1 Tom Clegg
83 1 Tom Clegg
Create the host configuration file @/etc/arvados/environment@.
84 1 Tom Clegg
85 1 Tom Clegg
<pre>
86 8 Tom Clegg
ARVADOS_NODE_PROFILE=dispatcher
87 1 Tom Clegg
</pre>
88 1 Tom Clegg
89 1 Tom Clegg
h2. Stop crunch-dispatch-slurm
90 1 Tom Clegg
91 1 Tom Clegg
Stop and disable the crunch-dispatch-slurm service, and uninstall the package to make sure it doesn't start after the next reboot/upgrade.
92 1 Tom Clegg
93 1 Tom Clegg
<pre>
94 1 Tom Clegg
# systemctl stop crunch-dispatch-slurm
95 1 Tom Clegg
# systemctl disable crunch-dispatch-slurm
96 1 Tom Clegg
# apt-get remove crunch-dispatch-slurm
97 1 Tom Clegg
</pre>
98 1 Tom Clegg
99 5 Tom Clegg
Containers that have already been locked and submitted to SLURM will make their way through the SLURM queue, but newly queued containers will be left for arvados-dispatch-cloud to run.
100 1 Tom Clegg
101 5 Tom Clegg
h2. Install arvados-dispatch-cloud
102 1 Tom Clegg
103 1 Tom Clegg
<pre>
104 5 Tom Clegg
# apt-get install arvados-dispatch-cloud
105 1 Tom Clegg
</pre>
106 1 Tom Clegg
107 13 Tom Clegg
For now, @ARVADOS_API_HOST@ and @ARVADOS_API_TOKEN@ environment variables must be provided (future versions will get these values from the config file). You can use @systemctl edit@ to do this through systemd:
108 13 Tom Clegg
109 13 Tom Clegg
<pre>
110 13 Tom Clegg
[Service]
111 13 Tom Clegg
Environment=ARVADOS_API_HOST=zzzzz.arvadosapi.com
112 13 Tom Clegg
Environment=ARVADOS_API_TOKEN=zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
113 13 Tom Clegg
</pre>
114 13 Tom Clegg
115 13 Tom Clegg
116 1 Tom Clegg
h2. Verify the service is running
117 1 Tom Clegg
118 1 Tom Clegg
<pre>
119 1 Tom Clegg
$ token="example-secret-management-token"
120 1 Tom Clegg
$ curl -H "Authorization: Bearer $token" http://localhost:9005/metrics
121 1 Tom Clegg
</pre>
122 1 Tom Clegg
123 1 Tom Clegg
h2. Verify the service is functional
124 5 Tom Clegg
125 5 Tom Clegg
Watch the dispatcher's logs while you run an Arvados container:
126 5 Tom Clegg
127 5 Tom Clegg
<pre>
128 5 Tom Clegg
# journalctl -ocat -fu arvados-dispatch-cloud
129 5 Tom Clegg
</pre>
130 12 Tom Clegg
131 12 Tom Clegg
Example logs:
132 12 Tom Clegg
133 12 Tom Clegg
<pre>
134 12 Tom Clegg
Starting Arvados dispatch cloud...
135 12 Tom Clegg
{"Listen":"[::]:9006","PID":46639,"Service":"arvados-dispatch-cloud","level":"info","msg":"listening","time":"2019-02-18T18:10:33.550358536Z"}
136 12 Tom Clegg
Started Arvados dispatch cloud.
137 12 Tom Clegg
{"PID":46639,"level":"info","msg":"FixStaleLocks starting.","time":"2019-02-18T18:10:33.706568502Z"}
138 12 Tom Clegg
{"PID":46639,"level":"info","msg":"FixStaleLocks finished (34.717µs), starting scheduling.","time":"2019-02-18T18:10:33.706606521Z"}
139 12 Tom Clegg
{"N":0,"PID":46639,"level":"info","msg":"loaded initial instance list","time":"2019-02-18T18:10:33.982989844Z"}
140 12 Tom Clegg
{"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","InstanceType":"Standard_D1_v2","PID":46639,"Priority":1124349393114703,"State":"Queued","level":"info","msg":"added container to queue","time":"2019-02-18T18:15:33.620474859Z"}
141 12 Tom Clegg
{"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","InstanceType":"Standard_D1_v2","PID":46639,"level":"info","msg":"creating new instance","time":"2019-02-18T18:15:33.711915757Z"}
142 12 Tom Clegg
{"Address":"10.25.64.8","IdleBehavior":"run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"State":"unknown","level":"info","msg":"instance appeared in cloud","time":"2019-02-18T18:16:34.512277597Z"}
143 12 Tom Clegg
{"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"dial tcp 10.25.64.8:2222: connect: connection refused","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:16:43.386626115Z"}
144 12 Tom Clegg
{"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"dial tcp 10.25.64.8:2222: connect: connection refused","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:16:53.381814784Z"}
145 12 Tom Clegg
{"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"Process exited with status 1","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:17:04.897049820Z"}
146 12 Tom Clegg
{"Address":"10.25.64.8","Command":"/bin/ls /arvados-compute-node-boot.complete  \u003e/dev/null 2\u003e\u00261 \u0026\u0026 sudo wget --quiet https://c97qk.arvadosapi.com/crunch-run --output-document=/usr/bin/crunch-run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2019-02-18T18:17:33.866226306Z"}
147 12 Tom Clegg
{"Address":"10.25.64.8","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:17:33.357445643Z","level":"info","msg":"instance booted; will try probeRunning","time":"2019-02-18T18:17:33.866286736Z"}
148 12 Tom Clegg
{"Address":"10.25.64.8","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:17:33.357445643Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2019-02-18T18:17:33.886177977Z"}
149 12 Tom Clegg
{"Address":"10.25.64.8","ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"Priority":1124349393114703,"level":"info","msg":"crunch-run process started","time":"2019-02-18T18:17:33.903776572Z"}
150 12 Tom Clegg
{"Address":"10.25.64.8","ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:19:03.315677365Z","level":"info","msg":"crunch-run process ended","time":"2019-02-18T18:19:03.334773308Z"}
151 12 Tom Clegg
{"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","PID":46639,"State":"Complete","level":"info","msg":"dropped container from queue","time":"2019-02-18T18:19:13.500015828Z"}
152 12 Tom Clegg
{"Address":"10.25.64.8","Age":129980928971,"IdleBehavior":"run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"State":"idle","level":"info","msg":"shutdown idle worker","time":"2019-02-18T18:21:13.255492437Z"}
153 12 Tom Clegg
{"PID":46639,"level":"info","msg":"Will delete compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-nic because it is older than 20s","time":"2019-02-18T18:22:35.044805153Z"}
154 12 Tom Clegg
{"Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","PID":46639,"WorkerState":"shutdown","level":"info","msg":"instance disappeared in cloud","time":"2019-02-18T18:22:35.086986833Z"}
155 12 Tom Clegg
{"PID":46639,"level":"info","msg":"Deleted NIC compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-nic","time":"2019-02-18T18:22:45.273921501Z"}
156 12 Tom Clegg
{"PID":46639,"level":"info","msg":"Blob compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-os.vhd is unlocked and not modified for 209.176892224 seconds, will delete","time":"2019-02-18T18:25:33.188314532Z"}
157 12 Tom Clegg
{"PID":46639,"level":"info","msg":"Deleted blob compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-os.vhd","time":"2019-02-18T18:25:33.194356552Z"}
158 12 Tom Clegg
</pre>