Project

General

Profile

Dispatching containers to cloud VMs » History » Revision 6

Revision 5 (Peter Amstutz, 08/04/2018 06:02 PM) → Revision 6/82 (Tom Clegg, 09/05/2018 09:08 PM)

h1. Dispatching containers to cloud VMs 

 (Draft) 

 h2. Component name 

 (TBD) crunch-dispatch-cloud, or arvados-dispatch-cloud, or arvados-dispatch (c-d-slurm (Draft. In fact, this might not be needed at all. For example, we might dispatch to kubernetes, and c-d-local could become arvados-dispatch modules, selected at runtime via config, rather than shipping as separate packages/programs). find/make a kubernetes auto-scaler, instead.) 

 h2. Overview Background 

 The dispatcher waits for containers This is about dispatching to appear in the queue, and runs them on appropriately sized on-demand cloud VMs. When there are no idle cloud VMs nodes like Amazon EC2 instances. 

 Not to be confused with dispatching to a cloud-based container service like Amazon Elastic Container Service, Azure Batch or Google Kubernetes Engine. 

 In crunch1, and the desired size, early days of crunch2, we made something work with arvados-nodemanager and SLURM. 

 One of the dispatcher brings up more VMs using goals of crunch2 is eliminating all uses of SLURM with the cloud provider's API. The dispatcher also shuts down idle VMs exception of crunch-dispatch-slurm, whose purpose is to dispatch arvados containers to a SLURM cluster that exceed already exists for non-Arvados tasks. 

 This doc doesn’t describe a sequence of development tasks or a migration plan. It describes the configured idle timer -- end state: how dispatch will work when all implementation tasks and sooner if the provider refuses to create new VMs. migrations are complete. 

 h2. Interaction with other Relevant components 

 API server (backed by PostgreSQL) supplies is the container queue: source of truth about which containers the system should be trying to execute (or cancel) at any given time. 

 The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys. 

 The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root. 

 h2. Configuration 

 Arvados configuration (currently a via file in /etc) supplies /etc, in future via consul/etcd/similar) is the source of truth about cloud provider credentials, allowed node types, spending limits/policies, etc. 

 <pre><code class="yaml"> 
     CloudVMs: 
       BootTimeout: 20m 
       Driver: Amazon 
       DriverParameters: 
         Region: us-east-1 
         APITimeout: 20s 
         EC2Key: abcdef 
         EC2Secret: abcdefghijklmnopqrstuvwxyz 
         StorageKey: abcdef 
         StorageSecret: abcdefghijklmnopqrstuvwxyz 
         ImageID: ami-0123456789abcdef0 
         SubnetID: subnet-01234567 
         SecurityGroups: sg-01234567 
 </code></pre> 

 h2. Scheduling policy 

 The container priority field determines the order in which resources are allocated. 
 * If container C1 has priority P1, 
 * ...and C2 has higher priority P2, 
 * ...and there is no pending/booting/idle VM suitable crunch-dispatch-cloud-node (a new component) arranges for running C2, 
 * ...then C1 will not be started. 

 However, queued containers that to run on different VM types don't necessarily start worker nodes, brings up new worker nodes in priority order. 
 * If container C1 has priority P1, 
 * ...and C2 has higher priority P2, 
 * ...and there is no order to run the queue faster, and shuts down idle VM suitable for running C2, 
 * ...and there is a pending/booting VM that will be suitable for running C2 when it comes up, 
 * ...and there is an idle VM suitable for running C1, 
 * ...then C1 will start before C2. worker nodes. 

 h2. Synchronizing state Overview of crunch-dispatch-cloud-node operation 

 When first starting up, dispatcher inspects inspect API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores restore internal state accordingly. accordingly 

 Often, at startup there will be some containers with state=Locked. To avoid breaking priority order, the dispatcher won't schedule any new containers until all such locked containers are matched up with crunch-run processes on existing VMs (typically preparing When API server puts a docker image) container in Queued state, lock it, select or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere create a cloud node to run it on, and need start a crunch-run process there to be rescheduled). run it 

 When API server says a user cancels a container request with state=Locked (locked or Running, dispatched by this dispatcher) should be cancelled, ensure the actual container priority changes to 0. On and its next poll, the dispatcher notices this crunch-run supervisor get shut down and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container). relevant node becomes idle 

 When a crunch-run process ends invocation (dispatched by this dispatcher) exits without finalizing its container's state, updating the dispatcher notices container record on the API server -- or can’t run at all -- clean up accordingly 

 Invariant: every dispatcher-tagged cloud node is either needed by this and sets state to Cancelled. dispatcher, or should be shut down (so if there are multiple dispatchers, they must use different tags). 

 h2. Operator view Mechanisms 

 h3. Interface between dispatcher and operator 

 Management status endpoint provides: 
 * provides a list of its cloud VMs, each with 
 ** provider's cloud instance ID 
 ** hourly price (from configuration file) 
 ** instance type (from configuration file) 
 ** instance type (from provider's menu) 
 ** ID, UUID of the current / most recent container it attempted (if known) 
 ** time last container finished (or boot time, if nothing run yet) 
 * list of queued/running containers, each with 
 ** UUID 
 ** state (queued/locked/running/complete/cancelled) 
 ** desired instance type 
 ** time appeared in queue 
 ** time started (if started) 

 Metrics endpoint tracks: 
 * (each VM) time elapsed between VM creation and first successful SSH connection 
 * (each VM) time elapsed between first successful SSH connection and ready to run a container 
 * total known), hourly price of all existing VMs 
 * total VCPUs price, and memory allocated to containers 
 * number of containers running 
 * number of containers allocated to VMs but idle time. This snapshot info is not started yet (because VMs are pending/booting) 
 * number of containers not allocated saved to VMs (because PostgreSQL. 

 h3. Interface between dispatcher and cloud provider quota is reached) 

 h2. SSH keys "VMProvider" interface has the few cloud instance operations we need (list instances+tags, create instance with tags, update instance tags, destroy instance). 

 h3. Interface between dispatcher and worker node 

 Each worker node has a public key in /root/.ssh/authorized_keys. Dispatcher has the corresponding private key. 

 (Future) Dispatcher generates its own keys and installs its public key on new VMs using cloud provider bootstrapping/metadata features. uses the Go SSH client library to connect to worker nodes. 

 h3. Probes Probe operation 

 Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself. 

 Probe: 
 * Check whether the SSH connection is alive; reopen if needed. 
 * Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting. 
 * Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID) 

 Dispatcher, after a successful probe, should tag the cloud node record with the dispatcher's ID and probe timestamp. (In case the tagging API fails, remember the probe time in memory too.) 

 h3. Detecting dead/lame Dead/lame nodes 

 If a node has been up for N seconds without a successful probe, despite at least M attempts, shut it down. (M handles the case where the dispatcher restarts during a time when the "update tags" operation isn't effective, e.g., provider is rate-limiting API calls.) 

 h3. Multiple Dead dispatchers 

 Not supported Every cluster should run multiple dispatchers. If one dies and stays down, the other must notice and take over (or shut down) its nodes -- otherwise those worker node will run forever. 

 Each dispatcher, when inspecting the list of cloud nodes, should try to claim (or, failing that, destroy) any node that belongs to a different dispatcher and hasn't completed a probe for N seconds. (This timeout might be longer than the "lame node" timeout.) 

 h2. Dispatcher internal flow 

 h3. Container lifecycle 

 # Dispatcher gets list of pending/running containers from API server 
 # Each container record either starts a new goroutine or sends the update on a channel to the existing goroutine 
 # On start, the container goroutine calls "allocate node" on scheduler with the container uuid, priority, instance type, and container state channel. 
 # "Allocate node" blocks until it can return a node object allocated to that container 
 # The node must be either idle, or busy running that specific container 
 ## If the container is "locked" and the node is idle, call "crunch-run" to start the container 
 ## If the container is "locked" and the node is busy, then do nothing and continue monitoring 
 ## If the container is "running" and the node is busy, then do nothing and continue monitoring 
 ## If the container is "running" and the node is idle, cancel the container (this means the original container run has been lost) 
 ## If the container is "complete" or "cancelled" then tell the scheduler to release the node 
 # Done 

 h3. Scheduler lifecycle 

 # Periodically poll all cloud nodes to get their current state (one of: booting, idle, busy, shutdown) and (free/locked) and what container they are running 
 # When an allocation request comes in, add to request list/heap (sorted by priority) and monitor container state channel for changes in initial version. priority 
 # If it is at the top of the request list, immediately try to schedule.    Otherwise, periodically run the scheduler on the request list starting from the top. 
 ## Check if there is a free, busy node that is already running that container and return that. 
 ## Check for a free, idle node of the appropriate instance type, if found, lock it and return it 
 ## If no node is found, start a "create node" goroutine and go on to the next request 
 ## Need to keep count of nodes in "booting" state, create nodes when number of unallocated requests exceeds number of nodes already booting 
 # When the scheduler gets notice that a new idle node is available, check the request list for pending requests for that instance type, and allocate the node (if not, node remains free / idle) 
 ## Similarly when scheduler gets a "release node" call, mark node as free / idle, then check request list 

 h3. Node startup 

 # Generate a random node id 
 # Starts a "create node" request, tagged with custom node id / instance type 
 # Send create request to the cloud 
 # Done 

 h3. Node lifecycle 

 # Periodically poll the cloud node list 
 # For compute in the cloud node list, start a monitor goroutine (if none already) 
 # Get the IP address 
 # Establish ssh connection 
 # Once connected, run "status" script to determine: 
 ## node state (booting, idle, busy, shutdown) 
 ## if the node is running a container 
 ## timestamp of last state transition 
 ## node state stored in small files in /var/lib/crunch protected by flock (?) 
 # Once node is ready, send notice of new idle node to the scheduler 
 # Node goroutine continues to monitor remote node state, send state updates to scheduler 
 # If node is idle for several consecutive status polling / scheduling cycles, put node into "shutdown" state (can no longer accept work) 
 # Send destroy request 
 # Wait for node to disappear from the cloud node list 
 # Done