Project

General

Profile

Migrating from arvados-node-manager to arvados-dispatch-cloud » History » Version 17

Tom Clegg, 03/19/2019 06:41 PM

1 5 Tom Clegg
h1. Migrating from arvados-node-manager to arvados-dispatch-cloud
2 1 Tom Clegg
3
{{toc}}
4
5
h2. Choose a node
6
7
The dispatch service can run on any host that can connect to the Arvados API service, the cloud provider's API, and the SSH service on cloud VMs. In the following example it runs on the same node as the API server and controller.
8
9 4 Tom Clegg
h2. Prepare key pair and worker VM image
10
11 9 Tom Clegg
Generate an SSH private key with no passphrase. Save it in the cluster configuration file (see @PrivateKey@ in the example below).
12 4 Tom Clegg
13 9 Tom Clegg
If you are using Azure, the dispatcher will create a login account and install your public key automatically, so you do *not* need to save the corresponding public key in an authorized_keys file in the VM image (or anywhere else, for that matter).
14 4 Tom Clegg
15 14 Tom Clegg
Prepare a worker VM image. It needs docker, arv-mount (python-arvados-fuse), and crunch-run ≥ 1.3.1.20190221194156.
16 4 Tom Clegg
17 1 Tom Clegg
h2. Update cluster configuration file
18
19
In @/etc/arvados/config.yml@, add configuration items for the dispatch service.
20
21
<pre><code class="yaml">
22
Clusters:
23 8 Tom Clegg
  zzzzz:
24 1 Tom Clegg
    CloudVMs:
25
      BootProbeCommand: "mount | grep /mnt/scratch"
26 16 Tom Clegg
      MaxCloudOpsPerSecond: 10
27 1 Tom Clegg
      SSHPort: "2222"
28
      SyncInterval: 1m
29
      TimeoutIdle: 2m
30
      TimeoutBooting: 10m
31
      TimeoutProbe: 5m
32
      TimeoutShutdown: 30s
33 8 Tom Clegg
      ImageID: "https://zzzzzzzz.blob.core.windows.net/system/Microsoft.Compute/Images/images/zzzzz-compute-osDisk.55555555-5555-5555-5555-555555555555.vhd"
34 7 Ward Vandewege
      Driver: azure
35 1 Tom Clegg
      DriverParameters:
36 2 Tom Clegg
        SubscriptionID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
37
        ClientID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
38 1 Tom Clegg
        ClientSecret: 2WyXt0XFbEtutnf2hp528t6Wk9S5bOHWkRaaWwavKQo=
39 8 Tom Clegg
        TenantID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
40 11 Tom Clegg
        CloudEnvironment: AzurePublicCloud
41 2 Tom Clegg
        ResourceGroup: zzzzz
42
        Location: centralus
43 3 Tom Clegg
        Network: zzzzz
44 2 Tom Clegg
        Subnet: zzzzz-subnet-private
45 3 Tom Clegg
        StorageAccount: example
46 2 Tom Clegg
        BlobContainer: vhds
47 11 Tom Clegg
        DeleteDanglingResourcesAfter: 20s
48 8 Tom Clegg
        AdminUsername: arvados
49 1 Tom Clegg
    Dispatch:
50 4 Tom Clegg
      PrivateKey: |
51
        -----BEGIN RSA PRIVATE KEY-----
52
        MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ
53
        0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E
54
        GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV
55
        mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7
56
        ...
57
        JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN
58
        ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI
59
        pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9
60
        1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ
61
        -----END RSA PRIVATE KEY-----
62 1 Tom Clegg
      StaleLockTimeout: 1m
63
      PollInterval: 10s
64
      ProbeInterval: 10s
65
      MaxProbesPerSecond: 10
66 16 Tom Clegg
      TimeoutSignal: 5s
67
      TimeoutTERM: 2m
68 1 Tom Clegg
    InstanceTypes:
69
      x1lg:
70
        ProviderType: x1.large
71
        VCPUs: 16
72 15 Tom Clegg
        RAM: 128GiB
73
        IncludedScratch: 128GiB
74 1 Tom Clegg
        Price: 1.23
75
    ManagementToken: "example-secret-management-token"
76
    NodeProfiles:
77 8 Tom Clegg
      dispatcher:                       # references ARVADOS_NODE_PROFILE in environment file (see below).
78 1 Tom Clegg
        arvados-dispatch-cloud:
79 10 Ward Vandewege
          Listen: ":9006"
80 1 Tom Clegg
</code></pre>
81
82
Create the host configuration file @/etc/arvados/environment@.
83
84
<pre>
85 8 Tom Clegg
ARVADOS_NODE_PROFILE=dispatcher
86 1 Tom Clegg
</pre>
87
88
h2. Stop crunch-dispatch-slurm
89
90
Stop and disable the crunch-dispatch-slurm service, and uninstall the package to make sure it doesn't start after the next reboot/upgrade.
91
92
<pre>
93
# systemctl stop crunch-dispatch-slurm
94
# systemctl disable crunch-dispatch-slurm
95
# apt-get remove crunch-dispatch-slurm
96
</pre>
97
98 5 Tom Clegg
Containers that have already been locked and submitted to SLURM will make their way through the SLURM queue, but newly queued containers will be left for arvados-dispatch-cloud to run.
99 1 Tom Clegg
100 5 Tom Clegg
h2. Install arvados-dispatch-cloud
101 1 Tom Clegg
102
<pre>
103 5 Tom Clegg
# apt-get install arvados-dispatch-cloud
104 1 Tom Clegg
</pre>
105
106 13 Tom Clegg
For now, @ARVADOS_API_HOST@ and @ARVADOS_API_TOKEN@ environment variables must be provided (future versions will get these values from the config file). You can use @systemctl edit@ to do this through systemd:
107
108
<pre>
109
[Service]
110
Environment=ARVADOS_API_HOST=zzzzz.arvadosapi.com
111
Environment=ARVADOS_API_TOKEN=zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
112
</pre>
113
114
115 1 Tom Clegg
h2. Verify the service is running
116
117
<pre>
118
$ token="example-secret-management-token"
119
$ curl -H "Authorization: Bearer $token" http://localhost:9005/metrics
120
</pre>
121
122
h2. Verify the service is functional
123 5 Tom Clegg
124
Watch the dispatcher's logs while you run an Arvados container:
125
126
<pre>
127
# journalctl -ocat -fu arvados-dispatch-cloud
128
</pre>
129 12 Tom Clegg
130
Example logs:
131
132
<pre>
133
Starting Arvados dispatch cloud...
134
{"Listen":"[::]:9006","PID":46639,"Service":"arvados-dispatch-cloud","level":"info","msg":"listening","time":"2019-02-18T18:10:33.550358536Z"}
135
Started Arvados dispatch cloud.
136
{"PID":46639,"level":"info","msg":"FixStaleLocks starting.","time":"2019-02-18T18:10:33.706568502Z"}
137
{"PID":46639,"level":"info","msg":"FixStaleLocks finished (34.717µs), starting scheduling.","time":"2019-02-18T18:10:33.706606521Z"}
138
{"N":0,"PID":46639,"level":"info","msg":"loaded initial instance list","time":"2019-02-18T18:10:33.982989844Z"}
139
{"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","InstanceType":"Standard_D1_v2","PID":46639,"Priority":1124349393114703,"State":"Queued","level":"info","msg":"added container to queue","time":"2019-02-18T18:15:33.620474859Z"}
140
{"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","InstanceType":"Standard_D1_v2","PID":46639,"level":"info","msg":"creating new instance","time":"2019-02-18T18:15:33.711915757Z"}
141
{"Address":"10.25.64.8","IdleBehavior":"run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"State":"unknown","level":"info","msg":"instance appeared in cloud","time":"2019-02-18T18:16:34.512277597Z"}
142
{"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"dial tcp 10.25.64.8:2222: connect: connection refused","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:16:43.386626115Z"}
143
{"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"dial tcp 10.25.64.8:2222: connect: connection refused","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:16:53.381814784Z"}
144
{"Address":"10.25.64.8","Command":"sudo crunch-run --list","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"error":"Process exited with status 1","level":"warning","msg":"probe failed","stderr":"","stdout":"","time":"2019-02-18T18:17:04.897049820Z"}
145
{"Address":"10.25.64.8","Command":"/bin/ls /arvados-compute-node-boot.complete  \u003e/dev/null 2\u003e\u00261 \u0026\u0026 sudo wget --quiet https://c97qk.arvadosapi.com/crunch-run --output-document=/usr/bin/crunch-run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"level":"info","msg":"boot probe succeeded","stderr":"","stdout":"","time":"2019-02-18T18:17:33.866226306Z"}
146
{"Address":"10.25.64.8","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:17:33.357445643Z","level":"info","msg":"instance booted; will try probeRunning","time":"2019-02-18T18:17:33.866286736Z"}
147
{"Address":"10.25.64.8","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:17:33.357445643Z","RunningContainers":0,"State":"idle","level":"info","msg":"probes succeeded, instance is in service","time":"2019-02-18T18:17:33.886177977Z"}
148
{"Address":"10.25.64.8","ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"Priority":1124349393114703,"level":"info","msg":"crunch-run process started","time":"2019-02-18T18:17:33.903776572Z"}
149
{"Address":"10.25.64.8","ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"ProbeStart":"2019-02-18T18:19:03.315677365Z","level":"info","msg":"crunch-run process ended","time":"2019-02-18T18:19:03.334773308Z"}
150
{"ContainerUUID":"c97qk-dz642-qxo0qjp93y2k4ht","PID":46639,"State":"Complete","level":"info","msg":"dropped container from queue","time":"2019-02-18T18:19:13.500015828Z"}
151
{"Address":"10.25.64.8","Age":129980928971,"IdleBehavior":"run","Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","InstanceType":"Standard_D1_v2","PID":46639,"State":"idle","level":"info","msg":"shutdown idle worker","time":"2019-02-18T18:21:13.255492437Z"}
152
{"PID":46639,"level":"info","msg":"Will delete compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-nic because it is older than 20s","time":"2019-02-18T18:22:35.044805153Z"}
153
{"Instance":"/subscriptions/zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq","PID":46639,"WorkerState":"shutdown","level":"info","msg":"instance disappeared in cloud","time":"2019-02-18T18:22:35.086986833Z"}
154
{"PID":46639,"level":"info","msg":"Deleted NIC compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-nic","time":"2019-02-18T18:22:45.273921501Z"}
155
{"PID":46639,"level":"info","msg":"Blob compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-os.vhd is unlocked and not modified for 209.176892224 seconds, will delete","time":"2019-02-18T18:25:33.188314532Z"}
156
{"PID":46639,"level":"info","msg":"Deleted blob compute-5656b5905b6e0c2d20ae4145148c6b31-7qs8dk3yfw0hteq-os.vhd","time":"2019-02-18T18:25:33.194356552Z"}
157
</pre>