Project

General

Profile

Dispatching containers to cloud VMs » History » Version 9

Tom Clegg, 10/11/2018 08:52 PM

1 1 Tom Clegg
h1. Dispatching containers to cloud VMs
2
3 6 Tom Clegg
(Draft)
4 1 Tom Clegg
5 7 Tom Clegg
h2. Component name / purpose
6 1 Tom Clegg
7 7 Tom Clegg
crunch-dispatch-cloud runs Arvados user containers on generic public cloud infrastructure by automatically creating and destroying VMs of various sizes according to demand, preparing the VMs' runtime environments, and running containers on them.
8 1 Tom Clegg
9 9 Tom Clegg
h2. Overview of operation
10 1 Tom Clegg
11 9 Tom Clegg
The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider starts refusing to create new VMs.
12 1 Tom Clegg
13 6 Tom Clegg
h2. Interaction with other components
14 1 Tom Clegg
15 9 Tom Clegg
Controller (backed by RailsAPI and PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time.
16 1 Tom Clegg
17 6 Tom Clegg
The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys.
18 1 Tom Clegg
19 6 Tom Clegg
The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root.
20 1 Tom Clegg
21 6 Tom Clegg
h2. Configuration
22 1 Tom Clegg
23 6 Tom Clegg
Arvados configuration (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc.
24 1 Tom Clegg
25 6 Tom Clegg
<pre><code class="yaml">
26
    CloudVMs:
27 8 Tom Clegg
      BootProbeCommand: "docker ps -q"
28
      SyncInterval: 1m    # get list of 
29
      TimeoutIdle: 1m     # shutdown if idle longer than this
30
      TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully
31
      TimeoutProbe: 2m    # shutdown if (after booting) communication fails longer than this, even if ctrs are running
32
      TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown
33 6 Tom Clegg
      Driver: Amazon
34 8 Tom Clegg
      DriverParameters:   # following configs are driver dependent
35 6 Tom Clegg
        Region: us-east-1
36
        APITimeout: 20s
37
        EC2Key: abcdef
38
        EC2Secret: abcdefghijklmnopqrstuvwxyz
39
        StorageKey: abcdef
40
        StorageSecret: abcdefghijklmnopqrstuvwxyz
41
        ImageID: ami-0123456789abcdef0
42 1 Tom Clegg
        SubnetID: subnet-01234567
43
        SecurityGroups: sg-01234567
44 8 Tom Clegg
    Dispatch:
45
      StaleLockTimeout: 1m     # after restart, time to wait for workers to come up before abandoning locks from previous run
46
      PollInterval: 1m         # how often to get latest queue from arvados controller
47
      ProbeInterval: 10s       # how often to probe each instance for current status/vital signs
48
      MaxProbesPerSecond: 1000 # limit total probe rate for dispatch process (across all instances)
49
      PrivateKey: |            # SSH key able to log in as root@ worker VMs
50
        -----BEGIN RSA PRIVATE KEY-----
51
        MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ
52
        0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E
53
        GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV
54
        mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7
55
        LtarBCFaK/pD7uWll/Uj7h7D8K48nIZUrvBJJjXL8Sm4LxCNoz3Z83k8J5ZzuDRD
56
        gRiQe/C085mhO6VL+2fypDLwcKt1tOL8fI81MwIDAQABAoIBACR3tEnmHsDbNOav
57
        Oxq8cwRQh9K2yDHg8BMJgz/TZa4FIx2HEbxVIw0/iLADtJ+Z/XzGJQCIiWQuvtg6
58
        exoFQESt7JUWRWkSkj9JCQJUoTY9Vl7APtBpqG7rIEQzd3TvzQcagZNRQZQO6rR7
59
        p8sBdBSZ72lK8cJ9tM3G7Kor/VNK7KgRZFNhEWnmvEa3qMd4hzDcQ4faOn7C9NZK
60
        dwJAuJVVfwOLlOORYcyEkvksLaDOK2DsB/p0AaCpfSmThRbBKN5fPXYaKgUdfp3w
61
        70Hpp27WWymb1cgjyqSH3DY+V/kvid+5QxgxCBRq865jPLn3FFT9bWEVS/0wvJRj
62
        iMIRrjECgYEA4Ffv9rBJXqVXonNQbbstd2PaprJDXMUy9/UmfHL6pkq1xdBeuM7v
63
        yf2ocXheA8AahHtIOhtgKqwv/aRhVK0ErYtiSvIk+tXG+dAtj/1ZAKbKiFyxjkZV
64
        X72BH7cTlR6As5SRRfWM/HaBGEgED391gKsI5PyMdqWWdczT5KfxAksCgYEAwXYE
65
        ewPmV1GaR5fbh2RupoPnUJPMj36gJCnwls7sGaXDQIpdlq56zfKgrLocGXGgj+8f
66
        QH7FHTJQO15YCYebtsXWwB3++iG43gVlJlecPAydsap2CCshqNWC5JU5pan0QzsP
67
        exzNzWqfUPSbTkR2SRaN+MenZo2Y/WqScOAth7kCgYBgVoLujW9EXH5QfXJpXLq+
68
        jTvE38I7oVcs0bJwOLPYGzcJtlwmwn6IYAwohgbhV2pLv+EZSs42JPEK278MLKxY
69
        lgVkp60npgunFTWroqDIvdc1TZDVxvA8h9VeODEJlSqxczgbMcIUXBM9yRctTI+5
70
        7DiKlMUA4kTFW2sWwuOlFwKBgGXvrYS0FVbFJKm8lmvMu5D5x5RpjEu/yNnFT4Pn
71
        G/iXoz4Kqi2PWh3STl804UF24cd1k94D7hDoReZCW9kJnz67F+C67XMW+bXi2d1O
72
        JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN
73
        ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI
74 1 Tom Clegg
        pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9
75
        1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ
76
        -----END RSA PRIVATE KEY-----
77 9 Tom Clegg
    InstanceTypes:
78
    - Name: m4.large
79
      VCPUs: 2
80
      RAM: 7782000000
81
      Scratch: 32000000000
82
      Price: 0.1
83
    - Name: m4.large.spot
84
      Preemptible: true
85
      VCPUs: 2
86
      RAM: 7782000000
87
      Scratch: 32000000000
88
      Price: 0.1
89
    - Name: m4.xlarge
90
      VCPUs: 4
91
      RAM: 15564000000
92
      Scratch: 80000000000
93
      Price: 0.2
94
    - Name: m4.xlarge.spot
95
      Preemptible: true
96
      VCPUs: 4
97
      RAM: 15564000000
98
      Scratch: 80000000000
99
      Price: 0.2
100
    - Name: m4.2xlarge
101
      VCPUs: 8
102
      RAM: 31129000000
103
      Scratch: 160000000000
104
      Price: 0.4
105
    - Name: m4.2xlarge.spot
106
      Preemptible: true
107
      VCPUs: 8
108
      RAM: 31129000000
109
      Scratch: 160000000000
110
      Price: 0.4
111 6 Tom Clegg
</code></pre>
112 1 Tom Clegg
113 6 Tom Clegg
h2. Scheduling policy
114 1 Tom Clegg
115 6 Tom Clegg
The container priority field determines the order in which resources are allocated.
116
* If container C1 has priority P1,
117
* ...and C2 has higher priority P2,
118
* ...and there is no pending/booting/idle VM suitable for running C2,
119
* ...then C1 will not be started.
120 1 Tom Clegg
121 6 Tom Clegg
However, containers that run on different VM types don't necessarily start in priority order.
122
* If container C1 has priority P1,
123
* ...and C2 has higher priority P2,
124
* ...and there is no idle VM suitable for running C2,
125
* ...and there is a pending/booting VM that will be suitable for running C2 when it comes up,
126
* ...and there is an idle VM suitable for running C1,
127
* ...then C1 will start before C2.
128 1 Tom Clegg
129 6 Tom Clegg
h2. Synchronizing state
130 1 Tom Clegg
131 6 Tom Clegg
When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly.
132 1 Tom Clegg
133 6 Tom Clegg
Often, at startup there will be some containers with state=Locked. To avoid breaking priority order, the dispatcher won't schedule any new containers until all such locked containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled).
134 1 Tom Clegg
135 6 Tom Clegg
When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container).
136 1 Tom Clegg
137 6 Tom Clegg
When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled.
138 1 Tom Clegg
139 6 Tom Clegg
h2. Operator view
140 1 Tom Clegg
141 6 Tom Clegg
Management status endpoint provides:
142
* list of cloud VMs, each with
143
** provider's instance ID
144
** hourly price (from configuration file)
145
** instance type (from configuration file)
146
** instance type (from provider's menu)
147
** UUID of the current / most recent container attempted (if known)
148
** time last container finished (or boot time, if nothing run yet)
149
* list of queued/running containers, each with
150
** UUID
151
** state (queued/locked/running/complete/cancelled)
152
** desired instance type
153
** time appeared in queue
154
** time started (if started)
155 5 Peter Amstutz
156 6 Tom Clegg
Metrics endpoint tracks:
157
* (each VM) time elapsed between VM creation and first successful SSH connection
158
* (each VM) time elapsed between first successful SSH connection and ready to run a container
159
* total hourly price of all existing VMs
160
* total VCPUs and memory allocated to containers
161
* number of containers running
162
* number of containers allocated to VMs but not started yet (because VMs are pending/booting)
163
* number of containers not allocated to VMs (because provider quota is reached)
164 4 Peter Amstutz
165 6 Tom Clegg
h2. SSH keys
166
167 5 Peter Amstutz
Each worker node has a public key in /root/.ssh/authorized_keys. Dispatcher has the corresponding private key.
168
169 6 Tom Clegg
(Future) Dispatcher generates its own keys and installs its public key on new VMs using cloud provider bootstrapping/metadata features.
170 5 Peter Amstutz
171 6 Tom Clegg
h3. Probes
172 4 Peter Amstutz
173
Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself.
174 5 Peter Amstutz
175
Probe:
176
* Check whether the SSH connection is alive; reopen if needed.
177
* Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting.
178
* Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID)
179
180
Dispatcher, after a successful probe, should tag the cloud node record with the dispatcher's ID and probe timestamp. (In case the tagging API fails, remember the probe time in memory too.)
181
182 6 Tom Clegg
h3. Detecting dead/lame nodes
183 5 Peter Amstutz
184
If a node has been up for N seconds without a successful probe, despite at least M attempts, shut it down. (M handles the case where the dispatcher restarts during a time when the "update tags" operation isn't effective, e.g., provider is rate-limiting API calls.)
185
186 6 Tom Clegg
h3. Multiple dispatchers
187 5 Peter Amstutz
188 6 Tom Clegg
Not supported in initial version.