Project

General

Profile

Dispatching containers to cloud VMs » History » Version 80

Ward Vandewege, 09/16/2020 05:51 PM

1 1 Tom Clegg
h1. Dispatching containers to cloud VMs
2
3 6 Tom Clegg
(Draft)
4 1 Tom Clegg
5 11 Tom Clegg
{{>toc}}
6
7 71 Tom Clegg
See also:
8
* [[cloudtest utility]]
9
10 7 Tom Clegg
h2. Component name / purpose
11 1 Tom Clegg
12 53 Tom Clegg
arvados-dispatch-cloud runs Arvados user containers on generic public cloud infrastructure by automatically creating and destroying VMs of various sizes according to demand, preparing the VMs' runtime environments, and running containers on them.
13 1 Tom Clegg
14 49 Tom Clegg
h2. Overview of operation
15
16
The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider starts refusing to create new VMs.
17
18
h2. Interaction with other components
19
20
Controller (backed by RailsAPI and PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time.
21
22
The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys.
23
24 57 Tom Clegg
The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root (either directly or via sudo).
25 49 Tom Clegg
26
h2. Instance tags
27
28
The dispatcher relies on the cloud provider's tagging feature to persist state across server restarts.
29
* {"InstanceType": "foo"} indicates that the instance was created with the specs from the instance type named "foo" in the cluster configuration file.
30 50 Tom Clegg
* {"IdleBehavior": "hold"} indicates that the management API has been used to put the instance in "hold" state.
31 58 Tom Clegg
* {"InstanceSecret": "ad23b6a8912f2b75d8a5e6887fbcb82f8024daea"} is a random string used to verify the instance's SSH host key.
32 49 Tom Clegg
33
Provider-specific drivers (Amazon, Google, Azure) determine exactly how these tags are encoded in the cloud API, and can use tags to persist their own internal state as well. For example, a driver might save tags named "Arvados-DispatchCloud-InstanceType" rather than just "InstanceType".
34
35 23 Tom Clegg
h2. Deployment
36 24 Tom Clegg
37 53 Tom Clegg
*Where to install:* The arvados-dispatch-cloud process can run anywhere, as long as it has network access to the Arvados controller, the cloud provider's API, and the worker VMs. Each Arvados cluster should run only one arvados-dispatch-cloud process.
38 9 Tom Clegg
* Future versions will support multiple dispatchers.
39 1 Tom Clegg
40 9 Tom Clegg
*Dispatcher's SSH key:* The operator must generate an SSH key pair for the dispatcher to use when connecting to cloud VMs. The private key is stored (without a passphrase) in the cluster configuration file. It does not need to be saved in @~/.ssh/@.
41 1 Tom Clegg
42 56 Tom Clegg
*Cloud VM image:* The operator must provide a VM image with an SSH server on a port reachable by the dispatcher (default 22, configurable per cluster). The dispatcher's SSH public key must be listed in @/root/.ssh/authorized_keys@. The image should also include systemd-cat (part of systemd) and suitable versions of docker and crunch-run. The @/var/lock@ directory must be available for lockfiles with names matching "@crunch-run-*.*@".
43 1 Tom Clegg
* It is possible to install docker and crunch-run using a custom boot probe command, but pre-installing is more efficient.
44
* Future versions will automatically sync the crunch-run binary from the dispatcher host to each worker node.
45 56 Tom Clegg
* The Azure driver creates a new admin user account and installs the SSH public key by itself so @/root/.ssh/authorized_keys@ is not needed. The VM image must include @sudo@.
46 1 Tom Clegg
47 6 Tom Clegg
*Cloud provider account:* The dispatcher uses cloud provider credentials to create and delete VMs and other cloud resources. An Arvados user can create an arbitrary number of long-running containers, and the dispatcher will try to run all of them. Currently the dispatcher does not enforce any resource limits of its own, so the operator must ensure the cloud provider itself is enforcing a suitable quota.
48 53 Tom Clegg
49 52 Tom Clegg
*Migrating from nodemanager/SLURM:* When VM images, SSH keys, and configuration files are ready, disable nodemanager and crunch-dispatch-slurm. Install arvados-dispatch-cloud deb/rpm package. Confirm success with @systemctl status arvados-dispatch-cloud@ and @journalctl -fu arvados-dispatch-cloud@. See [[Migrating from arvados-node-manager to arvados-dispatch-cloud]].
50 1 Tom Clegg
51 6 Tom Clegg
h2. Configuration
52 1 Tom Clegg
53 42 Tom Clegg
Arvados [[Cluster configuration]] (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc.
54 1 Tom Clegg
55 6 Tom Clegg
<pre><code class="yaml">
56
    CloudVMs:
57 1 Tom Clegg
      BootProbeCommand: "docker ps -q"
58 42 Tom Clegg
      SSHPort: 22
59 27 Tom Clegg
      SyncInterval: 1m    # how often to get list of active instances from cloud provider
60 8 Tom Clegg
      TimeoutIdle: 1m     # shutdown if idle longer than this
61
      TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully
62
      TimeoutProbe: 2m    # shutdown if (after booting) communication fails longer than this, even if ctrs are running
63
      TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown
64 6 Tom Clegg
      Driver: Amazon
65 8 Tom Clegg
      DriverParameters:   # following configs are driver dependent
66 6 Tom Clegg
        Region: us-east-1
67 59 Tom Clegg
        AccessKeyID: abcdef
68
        SecretAccessKey: abcdefghijklmnopqrstuvwxyz
69 39 Tom Clegg
        SubnetID: subnet-01234567
70 59 Tom Clegg
        SecurityGroupIDs: sg-01234567
71
        AdminUsername: ubuntu
72
        EBSVolumeType: gp2
73 8 Tom Clegg
    Dispatch:
74
      StaleLockTimeout: 1m     # after restart, time to wait for workers to come up before abandoning locks from previous run
75
      PollInterval: 1m         # how often to get latest queue from arvados controller
76
      ProbeInterval: 10s       # how often to probe each instance for current status/vital signs
77
      MaxProbesPerSecond: 1000 # limit total probe rate for dispatch process (across all instances)
78
      PrivateKey: |            # SSH key able to log in as root@ worker VMs
79
        -----BEGIN RSA PRIVATE KEY-----
80
        MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ
81
        0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E
82
        GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV
83
        mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7
84
        LtarBCFaK/pD7uWll/Uj7h7D8K48nIZUrvBJJjXL8Sm4LxCNoz3Z83k8J5ZzuDRD
85
        gRiQe/C085mhO6VL+2fypDLwcKt1tOL8fI81MwIDAQABAoIBACR3tEnmHsDbNOav
86
        Oxq8cwRQh9K2yDHg8BMJgz/TZa4FIx2HEbxVIw0/iLADtJ+Z/XzGJQCIiWQuvtg6
87
        exoFQESt7JUWRWkSkj9JCQJUoTY9Vl7APtBpqG7rIEQzd3TvzQcagZNRQZQO6rR7
88
        p8sBdBSZ72lK8cJ9tM3G7Kor/VNK7KgRZFNhEWnmvEa3qMd4hzDcQ4faOn7C9NZK
89
        dwJAuJVVfwOLlOORYcyEkvksLaDOK2DsB/p0AaCpfSmThRbBKN5fPXYaKgUdfp3w
90
        70Hpp27WWymb1cgjyqSH3DY+V/kvid+5QxgxCBRq865jPLn3FFT9bWEVS/0wvJRj
91
        iMIRrjECgYEA4Ffv9rBJXqVXonNQbbstd2PaprJDXMUy9/UmfHL6pkq1xdBeuM7v
92
        yf2ocXheA8AahHtIOhtgKqwv/aRhVK0ErYtiSvIk+tXG+dAtj/1ZAKbKiFyxjkZV
93
        X72BH7cTlR6As5SRRfWM/HaBGEgED391gKsI5PyMdqWWdczT5KfxAksCgYEAwXYE
94
        ewPmV1GaR5fbh2RupoPnUJPMj36gJCnwls7sGaXDQIpdlq56zfKgrLocGXGgj+8f
95
        QH7FHTJQO15YCYebtsXWwB3++iG43gVlJlecPAydsap2CCshqNWC5JU5pan0QzsP
96
        exzNzWqfUPSbTkR2SRaN+MenZo2Y/WqScOAth7kCgYBgVoLujW9EXH5QfXJpXLq+
97
        jTvE38I7oVcs0bJwOLPYGzcJtlwmwn6IYAwohgbhV2pLv+EZSs42JPEK278MLKxY
98
        lgVkp60npgunFTWroqDIvdc1TZDVxvA8h9VeODEJlSqxczgbMcIUXBM9yRctTI+5
99
        7DiKlMUA4kTFW2sWwuOlFwKBgGXvrYS0FVbFJKm8lmvMu5D5x5RpjEu/yNnFT4Pn
100
        G/iXoz4Kqi2PWh3STl804UF24cd1k94D7hDoReZCW9kJnz67F+C67XMW+bXi2d1O
101
        JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN
102
        ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI
103 1 Tom Clegg
        pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9
104
        1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ
105
        -----END RSA PRIVATE KEY-----
106 9 Tom Clegg
    InstanceTypes:
107
    - Name: m4.large
108
      VCPUs: 2
109
      RAM: 7782000000
110 66 Tom Clegg
      Scratch: 32000000000
111 59 Tom Clegg
      IncludedScratch: 32000000000
112 9 Tom Clegg
      Price: 0.1
113
    - Name: m4.large.spot
114
      Preemptible: true
115
      VCPUs: 2
116
      RAM: 7782000000
117 66 Tom Clegg
      Scratch: 32000000000
118 59 Tom Clegg
      IncludedScratch: 32000000000
119 9 Tom Clegg
      Price: 0.1
120
    - Name: m4.xlarge
121
      VCPUs: 4
122
      RAM: 15564000000
123 66 Tom Clegg
      Scratch: 80000000000
124 59 Tom Clegg
      IncludedScratch: 80000000000
125 9 Tom Clegg
      Price: 0.2
126
    - Name: m4.xlarge.spot
127
      Preemptible: true
128
      VCPUs: 4
129
      RAM: 15564000000
130 66 Tom Clegg
      Scratch: 80000000000
131 59 Tom Clegg
      IncludedScratch: 80000000000
132 9 Tom Clegg
      Price: 0.2
133
    - Name: m4.2xlarge
134
      VCPUs: 8
135
      RAM: 31129000000
136 66 Tom Clegg
      Scratch: 160000000000
137 59 Tom Clegg
      IncludedScratch: 160000000000
138 9 Tom Clegg
      Price: 0.4
139
    - Name: m4.2xlarge.spot
140
      Preemptible: true
141
      VCPUs: 8
142
      RAM: 31129000000
143 66 Tom Clegg
      Scratch: 160000000000
144 59 Tom Clegg
      IncludedScratch: 160000000000
145 9 Tom Clegg
      Price: 0.4
146 6 Tom Clegg
</code></pre>
147 1 Tom Clegg
148 10 Tom Clegg
h2. Management API
149 1 Tom Clegg
150 10 Tom Clegg
APIs for monitoring/diagnostics/control are available via HTTP on a configurable address/port. Request headers must include "Authorization: Bearer {management token}".
151
152
Responses are JSON-encoded and resemble other Arvados APIs:
153
<pre><code class="json">
154
{
155 43 Tom Clegg
  "items": [
156 10 Tom Clegg
    {
157 43 Tom Clegg
      "name": "...",
158 10 Tom Clegg
      ...
159
    },
160
    ...
161
  ]
162
}
163
</code></pre>
164
165
@GET /arvados/v1/dispatch/containers@ lists queued/locked/running containers. Each returned item includes:
166
* container UUID
167
* container state (Queued/Locked/Running/Complete/Cancelled)
168
* desired instance type
169 1 Tom Clegg
* time appeared in queue
170
* time started (if started)
171
* if you're switching from slurm, this is roughly *equivalent to squeue*
172
173 72 Tom Clegg
@POST /arvados/v1/dispatch/containers/kill?container_uuid=X@ terminates a container immediately.
174 69 Tom Clegg
* a single attempt is made to send SIGTERM to the container's supervisor (crunch-run) process
175 1 Tom Clegg
* container state/priority fields are not affected
176
* assuming SIGTERM works, the container record will end up with state "Cancelled"
177 10 Tom Clegg
* if you're switching from slurm, this is roughly *equivalent to scancel*
178 68 Tom Clegg
179
@GET /arvados/v1/dispatch/instances@ lists cloud VMs. Each returned item includes:
180
* provider's instance ID
181
* hourly price (from configuration file)
182
* instance type (from configuration file)
183
* instance type (from provider's menu)
184
* UUID of the current / most recent container attempted (if known)
185
* time last container finished (or boot time, if nothing run yet)
186
* if you're switching from slurm, this is roughly *equivalent to sinfo*
187 10 Tom Clegg
188 54 Tom Clegg
@POST /arvados/v1/dispatch/instances/hold?instance_id=X@ puts an instance in "hold" state.
189 34 Tom Clegg
* if the instance is currently running a container, it is allowed to continue
190
* no further containers will be scheduled on the instance
191
* the instance will not be shut down automatically
192
193 54 Tom Clegg
@POST /arvados/v1/dispatch/instances/drain?instance_id=X@ puts an instance in "drain" state.
194 1 Tom Clegg
* if the instance is currently running a container, it is allowed to continue
195
* no further containers will be scheduled on the instance
196 34 Tom Clegg
* the instance will be shut down automatically when all containers finish
197 1 Tom Clegg
198 55 Tom Clegg
@POST /arvados/v1/dispatch/instances/run?instance_id=X@ puts an instance in the default "run" state.
199
* if the instance is currently running a container, it is allowed to continue
200
* more containers will be scheduled on the instance when it becomes available
201
* the instance will be shut down automatically when it exceeds the configured idle timeout
202
203 61 Tom Clegg
@POST /arvados/v1/dispatch/instances/kill?instance_id=X@ shuts down an instance immediately.
204 1 Tom Clegg
* the instance is terminated immediately via cloud API
205 34 Tom Clegg
* SIGTERM is sent to the container if one is running, but no effort is made to give it time to end gracefully before terminating the instance
206 54 Tom Clegg
207
&dagger;@POST /arvados/v1/dispatch/loglevel?level=debug@ sets the logging threshold to "debug" or "info".
208
* @.../loglevel?level=debug@ enables debug logs
209
* @.../loglevel?level=info@ disables debug logs
210 47 Tom Clegg
211 77 Ward Vandewege
&dagger; not yet implemented
212
213
h2. Management CLI
214
215
Sub-command for *arvados-server*:
216
217
<pre>
218
arvados-server dispatch
219
</pre>
220
221
Provide a short form of the binary by renaming (or symlinking) *arvados-server* to *ad*, which will only provide access to the "dispatch" subcommands when invoked that way.
222
223
The subcommands can be abbreviated to the shortest form that is distinguishable from other subcommands.
224
225
Some commands apply to environments with arvados-dispatch-cloud or crunch-dispatch-slurm, and some only apply when arvados-dispatch-cloud is running.
226
227
The host that runs the *ad* binary must have access to a *config.yml* that lists at a minimum: the endpoint for the dispatcher and the management token.
228
229 80 Ward Vandewege
All commands support a *-o* flag to specify the type of output. The default is "table", which is fit for human consumption at the cli. The alternative is "json" which is suitable for machine consumption.
230
231 77 Ward Vandewege
Manage containers (arvados-dispatch-cloud and crunch-dispatch-slurm):
232
233 1 Tom Clegg
<pre>
234 80 Ward Vandewege
# list containers (default state is 'Queued,Locked,Running')
235 1 Tom Clegg
# possible states: Queued, Locked, Running, Complete, Cancelled
236 78 Ward Vandewege
# multiple states may be provided, separated with a comma
237 80 Ward Vandewege
$ ad containers list -s <state>
238 1 Tom Clegg
$ ad c l
239 77 Ward Vandewege
240
# terminate a container
241 78 Ward Vandewege
$ ad container terminate <uuid>
242 77 Ward Vandewege
$ ad c t <uuid>
243
</pre>
244
245 79 Ward Vandewege
&dagger; Inspect and manipulate loglevel of the running dispatcher (arvados-dispatch-cloud and crunch-dispatch-slurm):
246 77 Ward Vandewege
247
<pre>
248
# get arvados-dispatch loglevel
249
$ ad loglevel
250
$ ad l
251
252
# set arvados-dispatch loglevel
253
$ ad loglevel -set <debug|info>
254
$ ad l -set <debug|info>
255
</pre>
256 1 Tom Clegg
257
Manage instances (arvados-dispatch-cloud only):
258 77 Ward Vandewege
259
<pre>
260 80 Ward Vandewege
# list instances
261
$ ad instances list
262
$ ad i l
263 77 Ward Vandewege
264
# put instance in 'hold' state
265
$ ad instance hold <instance_id>
266
$ ad i h <instance_id>
267
268
# return instance to 'run' state
269
$ ad instance run <instance_id>
270
$ ad i r <instance_id>
271
272
# terminate instance immediately
273
$ ad instance terminate <instance_id>
274
$ ad i t <instance_id>
275
276
# ssh to instance
277
$ ad instance ssh <instance_id>
278
$ ad i s <instance_id>
279
</pre>
280
281
&dagger; not yet implemented
282
283 10 Tom Clegg
h2. Metrics
284 13 Tom Clegg
285 63 Tom Clegg
Metrics are available via HTTP on a configurable address/port (conventionally :9006). Request headers must include "Authorization: Bearer {management token}".
286 10 Tom Clegg
287 13 Tom Clegg
Metrics include:
288 1 Tom Clegg
* (gauge) number of existing VMs
289 35 Tom Clegg
* (gauge) total hourly price of all existing VMs
290 46 Tom Clegg
* (gauge) total VCPUs and memory in all existing VMs
291 1 Tom Clegg
* (gauge) total VCPUs and memory allocated to containers
292
* (gauge) number of containers running
293 76 Ward Vandewege
* (gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting)
294
* (gauge) number of containers not allocated to VMs (because provider quota is reached)
295 62 Tom Clegg
* (gauge) total hourly price of VMs, partitioned by allocation state (booting, running, idle, shutdown)
296 75 Tom Clegg
* (summary) time elapsed between VM creation and first successful SSH connection to that VM
297
* (summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM
298 76 Ward Vandewege
* (summary) time elapsed between first shutdown attempt on a VM and its disappearance from the provider listing
299
* (summary) wait times (between seeing a container in the queue or requeueing, and starting its crunch-run process on a worker) across previous starts
300
* (gauge) longest wait time of any unstarted container
301 73 Tom Clegg
* &dagger;(counter) cumulative instance time and cost, partitioned by allocation state and node type
302 76 Ward Vandewege
* (counter) VMs that have either become ready or reached boot timeout, partitioned by success/timeout
303 46 Tom Clegg
304
&dagger; not yet implemented
305
306 14 Tom Clegg
h2. Logs
307 20 Tom Clegg
308 16 Tom Clegg
For purposes of troubleshooting, a JSON-formatted log entry is printed on stderr when...
309 20 Tom Clegg
310 44 Tom Clegg
|                                                              |... if loglevel &ge; ...|...including timestamp and...|
311
|a new instance is created/ordered                             |info                    |instance type name|
312
|an instance appears on the provider's list of instances       |info                    |instance ID|
313
|an instance's boot probe succeeds                             |info                    |instance ID|
314
|an instance is shut down after boot timeout                   |warn                    |instance ID, &dagger;stdout/stderr/error from last boot probe attempt|
315
|an instance shutdown is requested                             |info                    |instance ID|
316
|an instance disappears from the provider's list of instances  |info                    |instance ID and previous state (booting/idle/shutdown)|
317
|a cloud provider API or driver error occurs                   |error                   |provider/driver's error message|
318 64 Tom Clegg
|a new container appears in the Arvados queue                  |info                    |container UUID, desired instance type name|
319 45 Tom Clegg
|a container is locked by the dispatcher                       |debug                   |container UUID|
320 44 Tom Clegg
|a crunch-run process is started on an instance                |info                    |container UUID, instance ID, crunch-run PID|
321
|a crunch-run process fails to start on an instance            |info                    |container UUID, instance ID, stdout/stderr/exitcode|
322
|a crunch-run process ends                                     |info                    |container UUID, instance ID|
323
|an active container's state changes to Complete or Cancelled  |info                    |container UUID, new state|
324
|an active container is requeued after being locked            |info                    |container UUID|  
325
|an Arvados API error occurs                                   |warn                    |error message|
326 16 Tom Clegg
327 44 Tom Clegg
&dagger; not yet implemented
328 1 Tom Clegg
329 51 Tom Clegg
Example log entries from test suite (note test suite uses text formatting, production logging uses JSON formatting):
330
<pre>
331
INFO[0000] creating new instance                         ContainerUUID=zzzzz-dz642-000000000000160 InstanceType=type8
332
INFO[0000] instance appeared in cloud                    IdleBehavior=run Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 State=booting
333
INFO[0000] boot probe succeeded                          Command=true Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 stderr= stdout=
334
INFO[0000] instance booted; will try probeRunning        Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 ProbeStart="2019-02-05 15:49:49.183431341 -0500 EST m=+0.126074285"
335
INFO[0000] probes succeeded, instance is in service      Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 ProbeStart="2019-02-05 15:49:49.183431341 -0500 EST m=+0.126074285" RunningContainers=0 State=idle
336
INFO[0000] crunch-run process started                    ContainerUUID=zzzzz-dz642-000000000000160 Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 Priority=20
337
INFO[0000] container finished                            ContainerUUID=zzzzz-dz642-000000000000160 State=Complete
338
...
339
INFO[0002] shutdown idle worker                          Age=151.615512ms IdleBehavior=run Instance=stub-providertype8-6ec34c367674cb74 InstanceType=type8 State=idle
340
INFO[0002] instance disappeared in cloud                 Instance=stub-providertype8-6ec34c367674cb74 WorkerState=shutdown
341
</pre>
342 14 Tom Clegg
343 10 Tom Clegg
If the dispatcher starts with a non-empty ARVADOS_DEBUG environment variable, it also prints more detailed logs about other internal state changes, using level=debug.
344
345
h2. Internal details
346
347 38 Tom Clegg
h3. Worker lifecycle
348
349
<pre>
350 41 Tom Clegg
351
  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
352
  │                                                                                                                                                    │
353
  │                  create() returns ID                                                                                                               │         want=drain
354
  │    ┌───────────────────────────────────────────────────────────────────────────┐                                      ┌────────────────────────────┼─────────────────────────────────────────┐
355
  │    │                                                                           ∨                                      │                            │                                         ∨
356
  │  ┌─────────────┐  appears in cloud list   ┌─────────┐  create() returns ID   ┌─────────┐  boot+run probes succeed   ┌──────┐  container starts   ┌─────────┐  container ends, want=drain   ┌──────────┐  instance disappears from cloud   ┌──────┐
357
  │  │ Nonexistent │ ───────────────────────> │ Unknown │ ─────────────────────> │ Booting │ ─────────────────────────> │      │ ──────────────────> │ Running │ ────────────────────────────> │          │ ────────────────────────────────> │ Gone │
358
  │  └─────────────┘                          └─────────┘                        └─────────┘                            │      │                     └─────────┘                               │          │                                   └──────┘
359
  │                                             │                                                                       │      │                                 idle timeout                  │          │
360
  │                                             │                                                                       │ Idle │ ────────────────────────────────────────────────────────────> │ Shutdown │
361
  │                                             │                                                                       │      │                                                               │          │
362
  │                                             │                                                                       │      │                                 probe timeout                 │          │
363
  │                                             │                                                                       │      │ ────────────────────────────────────────────────────────────> │          │
364
  │                                             │                                                                       └──────┘                                                               └──────────┘
365
  │                                             │                                                                         ∧      boot timeout                                                    ∧
366
  │                                             └─────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┘
367
  │                                                                                                                       │
368
  │   container ends                                                                                                      │
369
  └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘</pre>
370 38 Tom Clegg
371 10 Tom Clegg
h3. Scheduling policy
372 6 Tom Clegg
373
The container priority field determines the order in which resources are allocated.
374
* If container C1 has priority P1,
375
* ...and C2 has higher priority P2,
376
* ...and there is no pending/booting/idle VM suitable for running C2,
377
* ...then C1 will not be started.
378
379 1 Tom Clegg
However, containers that run on different VM types don't necessarily start in priority order.
380
* If container C1 has priority P1,
381 5 Peter Amstutz
* ...and C2 has higher priority P2,
382 6 Tom Clegg
* ...and there is no idle VM suitable for running C2,
383
* ...and there is a pending/booting VM that will be suitable for running C2 when it comes up,
384 1 Tom Clegg
* ...and there is an idle VM suitable for running C1,
385 6 Tom Clegg
* ...then C1 will start before C2.
386 10 Tom Clegg
387 1 Tom Clegg
h3. Special cases / synchronizing state
388 6 Tom Clegg
389
When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly.
390 10 Tom Clegg
391 36 Tom Clegg
At startup, some containers might have state=Locked. The dispatcher can't be sure these have no corresponding crunch-run process anywhere until it establishes communication with all running instances. To avoid breaking priority order by guessing wrong, the dispatcher avoids scheduling any new containers until all such "stale-locked" containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled).
392
393 37 Tom Clegg
At startup, some instances might still be running containers that were started by a prior invocation, even though the (new) boot probe command fails. Such instances are left alive at least until the containers finish. After that, the usual rules apply: if boot probe succeeds before boot timeout, start scheduling containers; otherwise, shut down. This allows the operator to configure a new image along with a new boot probe command that only works on the new image, without disrupting users' work.
394 1 Tom Clegg
395 4 Peter Amstutz
When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container).
396 6 Tom Clegg
397
When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled.
398 4 Peter Amstutz
399 5 Peter Amstutz
h3. Probes
400
401
Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself.
402
403
Probe:
404
* Check whether the SSH connection is alive; reopen if needed.
405
* Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting.
406
* Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID)
407 6 Tom Clegg
408 5 Peter Amstutz
h3. Detecting dead/lame nodes
409 10 Tom Clegg
410 65 Tom Clegg
If a node has been up for N seconds without a successful probe, it is shut down, even if it was running a container last time it was contacted successfully.
411 28 Tom Clegg
412
h1. Future plans / features
413
414
Per-instance-type VM images: It can be useful to run differently configured/tuned kernels/systems on different instance types, use different ops/monitoring systems on preemptible instances, etc. In addition to a system-wide default, each instance type could optionally specify an image.
415
416 1 Tom Clegg
Selectable VM images: When upgrading a production system, it can be useful to run a few trial containers on a new VM image before making it the default.