Project

General

Profile

Dispatching containers to cloud VMs » History » Version 2

Peter Amstutz, 08/03/2018 02:31 PM

1 1 Tom Clegg
h1. Dispatching containers to cloud VMs
2
3
(Draft. In fact, this might not be needed at all. For example, we might dispatch to kubernetes, and find/make a kubernetes auto-scaler, instead.)
4
5
h2. Background
6
7
This is about dispatching to on-demand cloud nodes like Amazon EC2 instances.
8
9
Not to be confused with dispatching to a cloud-based container service like Amazon Elastic Container Service, Azure Batch or Google Kubernetes Engine.
10
11
In crunch1, and the early days of crunch2, we made something work with arvados-nodemanager and SLURM.
12
13
One of the goals of crunch2 is eliminating all uses of SLURM with the exception of crunch-dispatch-slurm, whose purpose is to dispatch arvados containers to a SLURM cluster that already exists for non-Arvados tasks.
14
15
This doc doesn’t describe a sequence of development tasks or a migration plan. It describes the end state: how dispatch will work when all implementation tasks and migrations are complete.
16
17
h2. Relevant components
18
19
API server (backed by PostgreSQL) is the source of truth about which containers the system should be trying to execute (or cancel) at any given time.
20
21
Arvados configuration (currently via file in /etc, in future via consul/etcd/similar) is the source of truth about cloud provider credentials, allowed node types, spending limits/policies, etc.
22
23
crunch-dispatch-cloud-node (a new component) arranges for queued containers to run on worker nodes, brings up new worker nodes in order to run the queue faster, and shuts down idle worker nodes.
24
25
h2. Overview of crunch-dispatch-cloud-node operation
26
27
When first starting up, inspect API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restore internal state accordingly
28
29
When API server puts a container in Queued state, lock it, select or create a cloud node to run it on, and start a crunch-run process there to run it
30
31
When API server says a container (locked or dispatched by this dispatcher) should be cancelled, ensure the actual container and its crunch-run supervisor get shut down and the relevant node becomes idle
32
33
When a crunch-run invocation (dispatched by this dispatcher) exits without updating the container record on the API server -- or can’t run at all -- clean up accordingly
34
35
Invariant: every dispatcher-tagged cloud node is either needed by this dispatcher, or should be shut down (so if there are multiple dispatchers, they must use different tags).
36
37
h2. TBD
38
39
Mechanism for running commands on worker nodes: SSH?
40 2 Peter Amstutz
41
42
h1. "crunch-dispatch-cloud" (PA)
43
44
Node manager generates wishlist based on container queue.  Compute nodes run crunch-dispatch-local or similar service, which asks the API server for work and then runs it.
45
46
Advantages:
47
48
* Complete control over scheduling decisions / priority
49
50
Disadvantages:
51
52
* Additional load on API server (but probably not that much)
53
* Need a new scheme for nodes to report their status so that node manager knows if they are busy, idle.  Node manager has to be able to put nodes in equivalent of "draining" state to ensure they don't get shut down while doing work.  (We can use the "nodes" table for this).
54
* Need to be able to detect node failure.
55
56
h3. Starting up
57
58
# Node looks at pending containers to get a "wishlist"
59
# Nodes spin up the way they do now.  However, instead of registering with slurm, they start crunch-dispatch-local.
60
# Node ping token should have corresponding API token to be used by dispatcher to talk to API server
61
# C-d-l pings the API server to ask for work, the ping operation puts the node in either "busy" (if work is returned) or "idle"
62
63
h3. Running containers
64
65
Assumption: Nodes only run one container at once.
66
67
# Add "I am idle, give me work" API which locks and returns the next container that is appropriate for the node, or marks the node as "idle" if no work is available
68
# Node record records which container it is supposed to be running (can be part of the "Lock" call based on the per-node API token)
69
# C-d-l makes API call to nodes table to say it is "busy"
70
# C-d-l calls crunch-run to run the container
71
# C-d-l must continue to ping that it is "busy" every X seconds
72
# When container finishes, c-d-l pings that it is "idle"
73
74
h3. Shutting down
75
76
# When node manager decides a node is ready for shutdown, it makes an API call on the node record to indicate "draining".
77
# C-d-l pings "I am idle" on a "draining" record.  This puts the state in "drained" and c-d-l does not get any new work.
78
# Node manager sees the node is "drained" and can proceed with destroying the cloud node.
79
80
h3. Handling failure
81
82
# If a node enters a failure state and there is a container associated with it, the container should either be unlocked (if container is in locked state) or cancelled (if in running state).
83
# API server should have a background process which looks for nodes that haven't pinged recently puts them into failed state.
84
# Node can also put itself into failed state with an API call.