Project

General

Profile

Dispatching containers to cloud VMs » History » Version 5

Peter Amstutz, 08/04/2018 06:02 PM

1 1 Tom Clegg
h1. Dispatching containers to cloud VMs
2
3
(Draft. In fact, this might not be needed at all. For example, we might dispatch to kubernetes, and find/make a kubernetes auto-scaler, instead.)
4
5
h2. Background
6
7
This is about dispatching to on-demand cloud nodes like Amazon EC2 instances.
8
9
Not to be confused with dispatching to a cloud-based container service like Amazon Elastic Container Service, Azure Batch or Google Kubernetes Engine.
10
11
In crunch1, and the early days of crunch2, we made something work with arvados-nodemanager and SLURM.
12
13
One of the goals of crunch2 is eliminating all uses of SLURM with the exception of crunch-dispatch-slurm, whose purpose is to dispatch arvados containers to a SLURM cluster that already exists for non-Arvados tasks.
14
15
This doc doesn’t describe a sequence of development tasks or a migration plan. It describes the end state: how dispatch will work when all implementation tasks and migrations are complete.
16
17
h2. Relevant components
18
19
API server (backed by PostgreSQL) is the source of truth about which containers the system should be trying to execute (or cancel) at any given time.
20
21
Arvados configuration (currently via file in /etc, in future via consul/etcd/similar) is the source of truth about cloud provider credentials, allowed node types, spending limits/policies, etc.
22
23
crunch-dispatch-cloud-node (a new component) arranges for queued containers to run on worker nodes, brings up new worker nodes in order to run the queue faster, and shuts down idle worker nodes.
24
25
h2. Overview of crunch-dispatch-cloud-node operation
26
27
When first starting up, inspect API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restore internal state accordingly
28
29
When API server puts a container in Queued state, lock it, select or create a cloud node to run it on, and start a crunch-run process there to run it
30
31
When API server says a container (locked or dispatched by this dispatcher) should be cancelled, ensure the actual container and its crunch-run supervisor get shut down and the relevant node becomes idle
32
33
When a crunch-run invocation (dispatched by this dispatcher) exits without updating the container record on the API server -- or can’t run at all -- clean up accordingly
34
35
Invariant: every dispatcher-tagged cloud node is either needed by this dispatcher, or should be shut down (so if there are multiple dispatchers, they must use different tags).
36
37 3 Tom Clegg
h2. Mechanisms
38 1 Tom Clegg
39 3 Tom Clegg
h3. Interface between dispatcher and operator
40 1 Tom Clegg
41 3 Tom Clegg
Management status endpoint provides a list of its cloud VMs, each with cloud instance ID, UUID of the current / most recent container it attempted (if known), hourly price, and idle time. This snapshot info is not saved to PostgreSQL.
42 1 Tom Clegg
43 3 Tom Clegg
h3. Interface between dispatcher and cloud provider
44 1 Tom Clegg
45 3 Tom Clegg
"VMProvider" interface has the few cloud instance operations we need (list instances+tags, create instance with tags, update instance tags, destroy instance).
46 1 Tom Clegg
47 3 Tom Clegg
h3. Interface between dispatcher and worker node
48 1 Tom Clegg
49 3 Tom Clegg
Each worker node has a public key in /root/.ssh/authorized_keys. Dispatcher has the corresponding private key.
50 2 Peter Amstutz
51 3 Tom Clegg
Dispatcher uses the Go SSH client library to connect to worker nodes.
52 2 Peter Amstutz
53 3 Tom Clegg
h3. Probe operation
54 2 Peter Amstutz
55 3 Tom Clegg
Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself.
56 2 Peter Amstutz
57 3 Tom Clegg
Probe:
58
* Check whether the SSH connection is alive; reopen if needed.
59
* Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting.
60
* Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID)
61 2 Peter Amstutz
62 3 Tom Clegg
Dispatcher, after a successful probe, should tag the cloud node record with the dispatcher's ID and probe timestamp. (In case the tagging API fails, remember the probe time in memory too.)
63 2 Peter Amstutz
64 3 Tom Clegg
h3. Dead/lame nodes
65 2 Peter Amstutz
66 3 Tom Clegg
If a node has been up for N seconds without a successful probe, despite at least M attempts, shut it down. (M handles the case where the dispatcher restarts during a time when the "update tags" operation isn't effective, e.g., provider is rate-limiting API calls.)
67 2 Peter Amstutz
68 3 Tom Clegg
h3. Dead dispatchers
69 2 Peter Amstutz
70 3 Tom Clegg
Every cluster should run multiple dispatchers. If one dies and stays down, the other must notice and take over (or shut down) its nodes -- otherwise those worker node will run forever.
71 2 Peter Amstutz
72 3 Tom Clegg
Each dispatcher, when inspecting the list of cloud nodes, should try to claim (or, failing that, destroy) any node that belongs to a different dispatcher and hasn't completed a probe for N seconds. (This timeout might be longer than the "lame node" timeout.)
73 4 Peter Amstutz
74
h2. Dispatcher internal flow
75
76
h3. Container lifecycle
77
78 5 Peter Amstutz
# Dispatcher gets list of pending/running containers from API server
79
# Each container record either starts a new goroutine or sends the update on a channel to the existing goroutine
80
# On start, the container goroutine calls "allocate node" on scheduler with the container uuid, priority, instance type, and container state channel.
81
# "Allocate node" blocks until it can return a node object allocated to that container
82
# The node must be either idle, or busy running that specific container
83
## If the container is "locked" and the node is idle, call "crunch-run" to start the container
84
## If the container is "locked" and the node is busy, then do nothing and continue monitoring
85
## If the container is "running" and the node is busy, then do nothing and continue monitoring
86
## If the container is "running" and the node is idle, cancel the container (this means the original container run has been lost)
87
## If the container is "complete" or "cancelled" then tell the scheduler to release the node
88
# Done
89 1 Tom Clegg
90
h3. Scheduler lifecycle
91
92 5 Peter Amstutz
# Periodically poll all cloud nodes to get their current state (one of: booting, idle, busy, shutdown) and (free/locked) and what container they are running
93
# When an allocation request comes in, add to request list/heap (sorted by priority) and monitor container state channel for changes in priority
94
# If it is at the top of the request list, immediately try to schedule.  Otherwise, periodically run the scheduler on the request list starting from the top.
95
## Check if there is a free, busy node that is already running that container and return that.
96
## Check for a free, idle node of the appropriate instance type, if found, lock it and return it
97
## If no node is found, start a "create node" goroutine and go on to the next request
98
## Need to keep count of nodes in "booting" state, create nodes when number of unallocated requests exceeds number of nodes already booting
99
# When the scheduler gets notice that a new idle node is available, check the request list for pending requests for that instance type, and allocate the node (if not, node remains free / idle)
100
## Similarly when scheduler gets a "release node" call, mark node as free / idle, then check request list
101 4 Peter Amstutz
102 5 Peter Amstutz
h3. Node startup
103
104
# Generate a random node id
105
# Starts a "create node" request, tagged with custom node id / instance type
106
# Send create request to the cloud
107
# Done
108
109 4 Peter Amstutz
h3. Node lifecycle
110
111 5 Peter Amstutz
# Periodically poll the cloud node list
112
# For compute in the cloud node list, start a monitor goroutine (if none already)
113
# Get the IP address
114
# Establish ssh connection
115
# Once connected, run "status" script to determine:
116
## node state (booting, idle, busy, shutdown)
117
## if the node is running a container
118
## timestamp of last state transition
119
## node state stored in small files in /var/lib/crunch protected by flock (?)
120
# Once node is ready, send notice of new idle node to the scheduler
121
# Node goroutine continues to monitor remote node state, send state updates to scheduler
122
# If node is idle for several consecutive status polling / scheduling cycles, put node into "shutdown" state (can no longer accept work)
123
# Send destroy request
124
# Wait for node to disappear from the cloud node list
125
# Done