Project

General

Profile

Dispatching containers to cloud VMs » History » Version 48

Tom Clegg, 01/30/2019 04:26 PM

1 1 Tom Clegg
h1. Dispatching containers to cloud VMs
2
3 6 Tom Clegg
(Draft)
4 1 Tom Clegg
5 11 Tom Clegg
{{>toc}}
6
7 7 Tom Clegg
h2. Component name / purpose
8 1 Tom Clegg
9 7 Tom Clegg
crunch-dispatch-cloud runs Arvados user containers on generic public cloud infrastructure by automatically creating and destroying VMs of various sizes according to demand, preparing the VMs' runtime environments, and running containers on them.
10 1 Tom Clegg
11 12 Tom Clegg
h2. Deployment
12
13 32 Tom Clegg
*Where to install:* The crunch-dispatch-cloud process can run anywhere, as long as it has network access to the Arvados controller, the cloud provider's API, and the worker VMs. Each Arvados cluster should run only one crunch-dispatch-cloud process.
14
* Future versions will support multiple dispatchers.
15 21 Tom Clegg
16 1 Tom Clegg
*Dispatcher's SSH key:* The operator must generate an SSH key pair for the dispatcher to use when connecting to cloud VMs. The private key is stored (without a passphrase) in the cluster configuration file. It does not need to be saved in @~/.ssh/@.
17 21 Tom Clegg
18 33 Tom Clegg
*Cloud VM image:* The operator must provide a VM image with an SSH server on a port reachable by the dispatcher (default 22, configurable per cluster). The dispatcher's SSH public key must be listed in @/root/.ssh/authorized_keys@. The image should also include suitable versions of docker and crunch-run. The @/var/lock@ directory must be available for lockfiles with names matching "@crunch-run-*.*@".
19 32 Tom Clegg
* It is possible to install docker and crunch-run using a custom boot probe command, but pre-installing is more efficient.
20
* Future versions will automatically sync the crunch-run binary from the dispatcher host to each worker node.
21 12 Tom Clegg
22 31 Tom Clegg
*Cloud provider account:* The dispatcher uses cloud provider credentials to create and delete VMs and other cloud resources. An Arvados user can create an arbitrary number of long-running containers, and the dispatcher will try to run all of them. Currently the dispatcher does not enforce any resource limits of its own, so the operator must ensure the cloud provider itself is enforcing a suitable quota.
23 23 Tom Clegg
24 24 Tom Clegg
*Migrating from nodemanager/SLURM:* When VM images, SSH keys, and configuration files are ready, disable nodemanager and crunch-dispatch-slurm. Install crunch-dispatch-cloud deb/rpm package. Confirm success with @systemctl status crunch-dispatch-cloud@ and @journalctl -fu crunch-dispatch-cloud@.
25
26 9 Tom Clegg
h2. Overview of operation
27 1 Tom Clegg
28 9 Tom Clegg
The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider starts refusing to create new VMs.
29 1 Tom Clegg
30 6 Tom Clegg
h2. Interaction with other components
31 1 Tom Clegg
32 9 Tom Clegg
Controller (backed by RailsAPI and PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time.
33 1 Tom Clegg
34 6 Tom Clegg
The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys.
35 1 Tom Clegg
36 6 Tom Clegg
The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root.
37 1 Tom Clegg
38 6 Tom Clegg
h2. Configuration
39 1 Tom Clegg
40 42 Tom Clegg
Arvados [[Cluster configuration]] (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc.
41 1 Tom Clegg
42 6 Tom Clegg
<pre><code class="yaml">
43
    CloudVMs:
44 1 Tom Clegg
      BootProbeCommand: "docker ps -q"
45 42 Tom Clegg
      SSHPort: 22
46 27 Tom Clegg
      SyncInterval: 1m    # how often to get list of active instances from cloud provider
47 8 Tom Clegg
      TimeoutIdle: 1m     # shutdown if idle longer than this
48
      TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully
49
      TimeoutProbe: 2m    # shutdown if (after booting) communication fails longer than this, even if ctrs are running
50
      TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown
51 6 Tom Clegg
      Driver: Amazon
52 8 Tom Clegg
      DriverParameters:   # following configs are driver dependent
53 6 Tom Clegg
        Region: us-east-1
54
        APITimeout: 20s
55 39 Tom Clegg
        AWSAccessKeyID: abcdef
56
        AWSSecretAccessKey: abcdefghijklmnopqrstuvwxyz
57 6 Tom Clegg
        ImageID: ami-0123456789abcdef0
58 1 Tom Clegg
        SubnetID: subnet-01234567
59
        SecurityGroups: sg-01234567
60 8 Tom Clegg
    Dispatch:
61
      StaleLockTimeout: 1m     # after restart, time to wait for workers to come up before abandoning locks from previous run
62
      PollInterval: 1m         # how often to get latest queue from arvados controller
63
      ProbeInterval: 10s       # how often to probe each instance for current status/vital signs
64
      MaxProbesPerSecond: 1000 # limit total probe rate for dispatch process (across all instances)
65
      PrivateKey: |            # SSH key able to log in as root@ worker VMs
66
        -----BEGIN RSA PRIVATE KEY-----
67
        MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ
68
        0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E
69
        GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV
70
        mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7
71
        LtarBCFaK/pD7uWll/Uj7h7D8K48nIZUrvBJJjXL8Sm4LxCNoz3Z83k8J5ZzuDRD
72
        gRiQe/C085mhO6VL+2fypDLwcKt1tOL8fI81MwIDAQABAoIBACR3tEnmHsDbNOav
73
        Oxq8cwRQh9K2yDHg8BMJgz/TZa4FIx2HEbxVIw0/iLADtJ+Z/XzGJQCIiWQuvtg6
74
        exoFQESt7JUWRWkSkj9JCQJUoTY9Vl7APtBpqG7rIEQzd3TvzQcagZNRQZQO6rR7
75
        p8sBdBSZ72lK8cJ9tM3G7Kor/VNK7KgRZFNhEWnmvEa3qMd4hzDcQ4faOn7C9NZK
76
        dwJAuJVVfwOLlOORYcyEkvksLaDOK2DsB/p0AaCpfSmThRbBKN5fPXYaKgUdfp3w
77
        70Hpp27WWymb1cgjyqSH3DY+V/kvid+5QxgxCBRq865jPLn3FFT9bWEVS/0wvJRj
78
        iMIRrjECgYEA4Ffv9rBJXqVXonNQbbstd2PaprJDXMUy9/UmfHL6pkq1xdBeuM7v
79
        yf2ocXheA8AahHtIOhtgKqwv/aRhVK0ErYtiSvIk+tXG+dAtj/1ZAKbKiFyxjkZV
80
        X72BH7cTlR6As5SRRfWM/HaBGEgED391gKsI5PyMdqWWdczT5KfxAksCgYEAwXYE
81
        ewPmV1GaR5fbh2RupoPnUJPMj36gJCnwls7sGaXDQIpdlq56zfKgrLocGXGgj+8f
82
        QH7FHTJQO15YCYebtsXWwB3++iG43gVlJlecPAydsap2CCshqNWC5JU5pan0QzsP
83
        exzNzWqfUPSbTkR2SRaN+MenZo2Y/WqScOAth7kCgYBgVoLujW9EXH5QfXJpXLq+
84
        jTvE38I7oVcs0bJwOLPYGzcJtlwmwn6IYAwohgbhV2pLv+EZSs42JPEK278MLKxY
85
        lgVkp60npgunFTWroqDIvdc1TZDVxvA8h9VeODEJlSqxczgbMcIUXBM9yRctTI+5
86
        7DiKlMUA4kTFW2sWwuOlFwKBgGXvrYS0FVbFJKm8lmvMu5D5x5RpjEu/yNnFT4Pn
87
        G/iXoz4Kqi2PWh3STl804UF24cd1k94D7hDoReZCW9kJnz67F+C67XMW+bXi2d1O
88
        JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN
89
        ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI
90 1 Tom Clegg
        pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9
91
        1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ
92
        -----END RSA PRIVATE KEY-----
93 9 Tom Clegg
    InstanceTypes:
94
    - Name: m4.large
95
      VCPUs: 2
96
      RAM: 7782000000
97
      Scratch: 32000000000
98
      Price: 0.1
99
    - Name: m4.large.spot
100
      Preemptible: true
101
      VCPUs: 2
102
      RAM: 7782000000
103
      Scratch: 32000000000
104
      Price: 0.1
105
    - Name: m4.xlarge
106
      VCPUs: 4
107
      RAM: 15564000000
108
      Scratch: 80000000000
109
      Price: 0.2
110
    - Name: m4.xlarge.spot
111
      Preemptible: true
112
      VCPUs: 4
113
      RAM: 15564000000
114
      Scratch: 80000000000
115
      Price: 0.2
116
    - Name: m4.2xlarge
117
      VCPUs: 8
118
      RAM: 31129000000
119
      Scratch: 160000000000
120
      Price: 0.4
121
    - Name: m4.2xlarge.spot
122
      Preemptible: true
123
      VCPUs: 8
124
      RAM: 31129000000
125
      Scratch: 160000000000
126
      Price: 0.4
127 6 Tom Clegg
</code></pre>
128 1 Tom Clegg
129 10 Tom Clegg
h2. Management API
130 1 Tom Clegg
131 10 Tom Clegg
APIs for monitoring/diagnostics/control are available via HTTP on a configurable address/port. Request headers must include "Authorization: Bearer {management token}".
132
133
Responses are JSON-encoded and resemble other Arvados APIs:
134
<pre><code class="json">
135
{
136 43 Tom Clegg
  "items": [
137 10 Tom Clegg
    {
138 43 Tom Clegg
      "name": "...",
139 10 Tom Clegg
      ...
140
    },
141
    ...
142
  ]
143
}
144
</code></pre>
145
146
@GET /arvados/v1/dispatch/instances@ lists cloud VMs. Each returned item includes:
147
* provider's instance ID
148
* hourly price (from configuration file)
149
* instance type (from configuration file)
150
* instance type (from provider's menu)
151
* UUID of the current / most recent container attempted (if known)
152
* time last container finished (or boot time, if nothing run yet)
153
154
@GET /arvados/v1/dispatch/containers@ lists queued/locked/running containers. Each returned item includes:
155
* container UUID
156
* container state (Queued/Locked/Running/Complete/Cancelled)
157
* desired instance type
158
* time appeared in queue
159
* time started (if started)
160
161 34 Tom Clegg
@POST /arvados/v1/dispatch/instances/:instance_id/hold@ puts an instance in "hold" state.
162
* if the instance is currently running a container, it is allowed to continue
163
* no further containers will be scheduled on the instance
164
* the instance will not be shut down automatically
165
166 10 Tom Clegg
@POST /arvados/v1/dispatch/instances/:instance_id/drain@ puts an instance in "drain" state.
167 1 Tom Clegg
* if the instance is currently running a container, it is allowed to continue
168
* no further containers will be scheduled on the instance
169 34 Tom Clegg
* the instance will be shut down automatically when all containers finish
170 1 Tom Clegg
171 48 Tom Clegg
&dagger;@POST /arvados/v1/dispatch/instances/:instance_id/kill@ shuts down an instance immediately.
172 34 Tom Clegg
* the instance is terminated immediately via cloud API
173
* SIGTERM is sent to the container if one is running, but no effort is made to give it time to end gracefully before terminating the instance
174
175 48 Tom Clegg
&dagger;@POST /arvados/v1/dispatch/loglevel/:level@ sets the logging threshold to "debug" or "info".
176 47 Tom Clegg
* @.../loglevel/debug@ enables debug logs
177
* @.../loglevel/info@ disables debug logs
178
179 10 Tom Clegg
h2. Metrics
180 13 Tom Clegg
181 10 Tom Clegg
Metrics are available via HTTP on a configurable address/port (conventionally :9005). Request headers must include "Authorization: Bearer {management token}".
182
183 13 Tom Clegg
Metrics include:
184 1 Tom Clegg
* (gauge) number of existing VMs
185 35 Tom Clegg
* (gauge) total hourly price of all existing VMs
186 46 Tom Clegg
* (gauge) total VCPUs and memory in all existing VMs
187 1 Tom Clegg
* (gauge) total VCPUs and memory allocated to containers
188
* (gauge) number of containers running
189 46 Tom Clegg
* &dagger;(gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting)
190
* &dagger;(gauge) number of containers not allocated to VMs (because provider quota is reached)
191
* &dagger;(gauge) total hourly price of VMs, partitioned by allocation state (running, booting/idle, adminhold)
192
* &dagger;(summary) time elapsed between VM creation and first successful SSH connection to that VM
193
* &dagger;(summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM
194
* &dagger;(summary) time elapsed between first shutdown attempt on a VM and its disappearance from the provider listing
195
196
&dagger; not yet implemented
197
198 14 Tom Clegg
199
h2. Logs
200 20 Tom Clegg
201 16 Tom Clegg
For purposes of troubleshooting, a JSON-formatted log entry is printed on stderr when...
202 20 Tom Clegg
203 44 Tom Clegg
|                                                              |... if loglevel &ge; ...|...including timestamp and...|
204
|a new instance is created/ordered                             |info                    |instance type name|
205
|an instance appears on the provider's list of instances       |info                    |instance ID|
206
|an instance's boot probe succeeds                             |info                    |instance ID|
207
|an instance is shut down after boot timeout                   |warn                    |instance ID, &dagger;stdout/stderr/error from last boot probe attempt|
208
|an instance shutdown is requested                             |info                    |instance ID|
209
|an instance disappears from the provider's list of instances  |info                    |instance ID and previous state (booting/idle/shutdown)|
210
|a cloud provider API or driver error occurs                   |error                   |provider/driver's error message|
211
|a new container appears in the Arvados queue                  |&dagger;info            |container UUID, desired instance type name|
212 45 Tom Clegg
|a container is locked by the dispatcher                       |debug                   |container UUID|
213 44 Tom Clegg
|a crunch-run process is started on an instance                |info                    |container UUID, instance ID, crunch-run PID|
214
|a crunch-run process fails to start on an instance            |info                    |container UUID, instance ID, stdout/stderr/exitcode|
215
|a crunch-run process ends                                     |info                    |container UUID, instance ID|
216
|an active container's state changes to Complete or Cancelled  |info                    |container UUID, new state|
217
|an active container is requeued after being locked            |info                    |container UUID|  
218
|an Arvados API error occurs                                   |warn                    |error message|
219 16 Tom Clegg
220 44 Tom Clegg
&dagger; not yet implemented
221 14 Tom Clegg
222
(Example log entries should be shown here)
223
224 10 Tom Clegg
If the dispatcher starts with a non-empty ARVADOS_DEBUG environment variable, it also prints more detailed logs about other internal state changes, using level=debug.
225
226
h2. Internal details
227
228 38 Tom Clegg
h3. Worker lifecycle
229
230
<pre>
231 41 Tom Clegg
232
  ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
233
  │                                                                                                                                                    │
234
  │                  create() returns ID                                                                                                               │         want=drain
235
  │    ┌───────────────────────────────────────────────────────────────────────────┐                                      ┌────────────────────────────┼─────────────────────────────────────────┐
236
  │    │                                                                           ∨                                      │                            │                                         ∨
237
  │  ┌─────────────┐  appears in cloud list   ┌─────────┐  create() returns ID   ┌─────────┐  boot+run probes succeed   ┌──────┐  container starts   ┌─────────┐  container ends, want=drain   ┌──────────┐  instance disappears from cloud   ┌──────┐
238
  │  │ Nonexistent │ ───────────────────────> │ Unknown │ ─────────────────────> │ Booting │ ─────────────────────────> │      │ ──────────────────> │ Running │ ────────────────────────────> │          │ ────────────────────────────────> │ Gone │
239
  │  └─────────────┘                          └─────────┘                        └─────────┘                            │      │                     └─────────┘                               │          │                                   └──────┘
240
  │                                             │                                                                       │      │                                 idle timeout                  │          │
241
  │                                             │                                                                       │ Idle │ ────────────────────────────────────────────────────────────> │ Shutdown │
242
  │                                             │                                                                       │      │                                                               │          │
243
  │                                             │                                                                       │      │                                 probe timeout                 │          │
244
  │                                             │                                                                       │      │ ────────────────────────────────────────────────────────────> │          │
245
  │                                             │                                                                       └──────┘                                                               └──────────┘
246
  │                                             │                                                                         ∧      boot timeout                                                    ∧
247
  │                                             └─────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────┘
248
  │                                                                                                                       │
249
  │   container ends                                                                                                      │
250
  └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘</pre>
251 38 Tom Clegg
252 10 Tom Clegg
h3. Scheduling policy
253 6 Tom Clegg
254
The container priority field determines the order in which resources are allocated.
255
* If container C1 has priority P1,
256
* ...and C2 has higher priority P2,
257
* ...and there is no pending/booting/idle VM suitable for running C2,
258
* ...then C1 will not be started.
259
260 1 Tom Clegg
However, containers that run on different VM types don't necessarily start in priority order.
261
* If container C1 has priority P1,
262 5 Peter Amstutz
* ...and C2 has higher priority P2,
263 6 Tom Clegg
* ...and there is no idle VM suitable for running C2,
264
* ...and there is a pending/booting VM that will be suitable for running C2 when it comes up,
265 1 Tom Clegg
* ...and there is an idle VM suitable for running C1,
266 6 Tom Clegg
* ...then C1 will start before C2.
267 10 Tom Clegg
268 1 Tom Clegg
h3. Special cases / synchronizing state
269 6 Tom Clegg
270
When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly.
271 10 Tom Clegg
272 36 Tom Clegg
At startup, some containers might have state=Locked. The dispatcher can't be sure these have no corresponding crunch-run process anywhere until it establishes communication with all running instances. To avoid breaking priority order by guessing wrong, the dispatcher avoids scheduling any new containers until all such "stale-locked" containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled).
273
274 37 Tom Clegg
At startup, some instances might still be running containers that were started by a prior invocation, even though the (new) boot probe command fails. Such instances are left alive at least until the containers finish. After that, the usual rules apply: if boot probe succeeds before boot timeout, start scheduling containers; otherwise, shut down. This allows the operator to configure a new image along with a new boot probe command that only works on the new image, without disrupting users' work.
275 1 Tom Clegg
276 4 Peter Amstutz
When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container).
277 6 Tom Clegg
278
When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled.
279 4 Peter Amstutz
280 5 Peter Amstutz
h3. Probes
281
282
Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself.
283
284
Probe:
285
* Check whether the SSH connection is alive; reopen if needed.
286
* Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting.
287
* Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID)
288 6 Tom Clegg
289 5 Peter Amstutz
h3. Detecting dead/lame nodes
290 10 Tom Clegg
291 28 Tom Clegg
If a node has been up for N seconds without a successful probe, despite at least M attempts, it is shut down, even if it was running a container last time it was contacted successfully.
292
293
h1. Future plans / features
294
295
Per-instance-type VM images: It can be useful to run differently configured/tuned kernels/systems on different instance types, use different ops/monitoring systems on preemptible instances, etc. In addition to a system-wide default, each instance type could optionally specify an image.
296
297 1 Tom Clegg
Selectable VM images: When upgrading a production system, it can be useful to run a few trial containers on a new VM image before making it the default.