Project

General

Profile

Dispatching containers to cloud VMs » History » Revision 3

Revision 2 (Peter Amstutz, 08/03/2018 02:31 PM) → Revision 3/82 (Tom Clegg, 08/04/2018 04:00 AM)

h1. Dispatching containers to cloud VMs 

 (Draft. In fact, this might not be needed at all. For example, we might dispatch to kubernetes, and find/make a kubernetes auto-scaler, instead.) 

 h2. Background 

 This is about dispatching to on-demand cloud nodes like Amazon EC2 instances. 

 Not to be confused with dispatching to a cloud-based container service like Amazon Elastic Container Service, Azure Batch or Google Kubernetes Engine. 

 In crunch1, and the early days of crunch2, we made something work with arvados-nodemanager and SLURM. 

 One of the goals of crunch2 is eliminating all uses of SLURM with the exception of crunch-dispatch-slurm, whose purpose is to dispatch arvados containers to a SLURM cluster that already exists for non-Arvados tasks. 

 This doc doesn’t describe a sequence of development tasks or a migration plan. It describes the end state: how dispatch will work when all implementation tasks and migrations are complete. 

 h2. Relevant components 

 API server (backed by PostgreSQL) is the source of truth about which containers the system should be trying to execute (or cancel) at any given time. 

 Arvados configuration (currently via file in /etc, in future via consul/etcd/similar) is the source of truth about cloud provider credentials, allowed node types, spending limits/policies, etc. 

 crunch-dispatch-cloud-node (a new component) arranges for queued containers to run on worker nodes, brings up new worker nodes in order to run the queue faster, and shuts down idle worker nodes. 

 h2. Overview of crunch-dispatch-cloud-node operation 

 When first starting up, inspect API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restore internal state accordingly 

 When API server puts a container in Queued state, lock it, select or create a cloud node to run it on, and start a crunch-run process there to run it 

 When API server says a container (locked or dispatched by this dispatcher) should be cancelled, ensure the actual container and its crunch-run supervisor get shut down and the relevant node becomes idle 

 When a crunch-run invocation (dispatched by this dispatcher) exits without updating the container record on the API server -- or can’t run at all -- clean up accordingly 

 Invariant: every dispatcher-tagged cloud node is either needed by this dispatcher, or should be shut down (so if there are multiple dispatchers, they must use different tags). 

 h2. Mechanisms TBD 

 h3. Interface between dispatcher and operator Mechanism for running commands on worker nodes: SSH? 


 h1. "crunch-dispatch-cloud" (PA) 

 Management status endpoint provides a list of its cloud VMs, each with cloud instance ID, UUID of Node manager generates wishlist based on container queue.    Compute nodes run crunch-dispatch-local or similar service, which asks the current / most recent container it attempted (if known), hourly price, API server for work and idle time. This snapshot info is not saved to PostgreSQL. then runs it. 

 h3. Interface between dispatcher and cloud provider Advantages: 

 "VMProvider" interface has the few cloud instance operations we need (list instances+tags, create instance with tags, update instance tags, destroy instance). * Complete control over scheduling decisions / priority 

 h3. Interface between dispatcher and worker node Disadvantages: 

 Each worker * Additional load on API server (but probably not that much) 
 * Need a new scheme for nodes to report their status so that node manager knows if they are busy, idle.    Node manager has a public key to be able to put nodes in /root/.ssh/authorized_keys. Dispatcher has equivalent of "draining" state to ensure they don't get shut down while doing work.    (We can use the corresponding private key. 

 Dispatcher uses the Go SSH client library "nodes" table for this). 
 * Need to connect be able to worker nodes. detect node failure. 

 h3. Probe operation Starting up 

 Sometimes (on # Node looks at pending containers to get a "wishlist" 
 # Nodes spin up the happy path) the way they do now.    However, instead of registering with slurm, they start crunch-dispatch-local. 
 # Node ping token should have corresponding API token to be used by dispatcher knows to talk to API server 
 # C-d-l pings the state of each worker, whether it's idle, and which container it's running. In general, it's necessary API server to probe ask for work, the worker ping operation puts the node itself. in either "busy" (if work is returned) or "idle" 

 Probe: 
 * Check whether h3. Running containers 

 Assumption: Nodes only run one container at once. 

 # Add "I am idle, give me work" API which locks and returns the SSH connection next container that is alive; reopen if needed. 
 * Run appropriate for the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude node, or marks the node as "idle" if no work is still booting. available 
 * Run "crunch-run --list" # Node record records which container it is supposed to get a list be running (can be part of the "Lock" call based on the per-node API token) 
 # C-d-l makes API call to nodes table to say it is "busy" 
 # C-d-l calls crunch-run supervisors (pid + to run the container UUID) 
 # C-d-l must continue to ping that it is "busy" every X seconds 
 # When container finishes, c-d-l pings that it is "idle" 

 Dispatcher, after h3. Shutting down 

 # When node manager decides a successful probe, should tag node is ready for shutdown, it makes an API call on the cloud node record with to indicate "draining". 
 # C-d-l pings "I am idle" on a "draining" record.    This puts the dispatcher's ID state in "drained" and probe timestamp. (In case c-d-l does not get any new work. 
 # Node manager sees the tagging API fails, remember node is "drained" and can proceed with destroying the probe time in memory too.) cloud node. 

 h3. Dead/lame nodes Handling failure 

 # If a node has been up for N seconds without enters a successful probe, despite at least M attempts, shut it down. (M handles the case where the dispatcher restarts during failure state and there is a time when container associated with it, the "update tags" operation isn't effective, e.g., provider container should either be unlocked (if container is rate-limiting in locked state) or cancelled (if in running state). 
 # API calls.) 

 h3. Dead dispatchers 

 Every cluster server should run multiple dispatchers. If one dies and stays down, the other must notice and take over (or shut down) its have a background process which looks for nodes -- otherwise those worker node will run forever. 

 Each dispatcher, when inspecting the list of cloud nodes, should try to claim (or, failing that, destroy) any node that belongs to a different dispatcher and hasn't completed a probe for N seconds. (This timeout might be longer than the "lame node" timeout.) haven't pinged recently puts them into failed state. 
 # Node can also put itself into failed state with an API call.