Project

General

Profile

Dispatching containers to cloud VMs » History » Revision 10

Revision 9 (Tom Clegg, 10/11/2018 08:52 PM) → Revision 10/82 (Tom Clegg, 10/11/2018 09:27 PM)

h1. Dispatching containers to cloud VMs 

 (Draft) 

 h2. Component name / purpose 

 crunch-dispatch-cloud runs Arvados user containers on generic public cloud infrastructure by automatically creating and destroying VMs of various sizes according to demand, preparing the VMs' runtime environments, and running containers on them. 

 h2. Overview of operation 

 The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider starts refusing to create new VMs. 

 h2. Interaction with other components 

 Controller (backed by RailsAPI and PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time. 

 The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys. 

 The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root. 

 h2. Configuration 

 Arvados configuration (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc. 

 <pre><code class="yaml"> 
     CloudVMs: 
       BootProbeCommand: "docker ps -q" 
       SyncInterval: 1m      # get list of  
       TimeoutIdle: 1m       # shutdown if idle longer than this 
       TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully 
       TimeoutProbe: 2m      # shutdown if (after booting) communication fails longer than this, even if ctrs are running 
       TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown 
       Driver: Amazon 
       DriverParameters:     # following configs are driver dependent 
         Region: us-east-1 
         APITimeout: 20s 
         EC2Key: abcdef 
         EC2Secret: abcdefghijklmnopqrstuvwxyz 
         StorageKey: abcdef 
         StorageSecret: abcdefghijklmnopqrstuvwxyz 
         ImageID: ami-0123456789abcdef0 
         SubnetID: subnet-01234567 
         SecurityGroups: sg-01234567 
     Dispatch: 
       StaleLockTimeout: 1m       # after restart, time to wait for workers to come up before abandoning locks from previous run 
       PollInterval: 1m           # how often to get latest queue from arvados controller 
       ProbeInterval: 10s         # how often to probe each instance for current status/vital signs 
       MaxProbesPerSecond: 1000 # limit total probe rate for dispatch process (across all instances) 
       PrivateKey: |              # SSH key able to log in as root@ worker VMs 
         -----BEGIN RSA PRIVATE KEY----- 
         MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ 
         0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E 
         GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV 
         mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7 
         LtarBCFaK/pD7uWll/Uj7h7D8K48nIZUrvBJJjXL8Sm4LxCNoz3Z83k8J5ZzuDRD 
         gRiQe/C085mhO6VL+2fypDLwcKt1tOL8fI81MwIDAQABAoIBACR3tEnmHsDbNOav 
         Oxq8cwRQh9K2yDHg8BMJgz/TZa4FIx2HEbxVIw0/iLADtJ+Z/XzGJQCIiWQuvtg6 
         exoFQESt7JUWRWkSkj9JCQJUoTY9Vl7APtBpqG7rIEQzd3TvzQcagZNRQZQO6rR7 
         p8sBdBSZ72lK8cJ9tM3G7Kor/VNK7KgRZFNhEWnmvEa3qMd4hzDcQ4faOn7C9NZK 
         dwJAuJVVfwOLlOORYcyEkvksLaDOK2DsB/p0AaCpfSmThRbBKN5fPXYaKgUdfp3w 
         70Hpp27WWymb1cgjyqSH3DY+V/kvid+5QxgxCBRq865jPLn3FFT9bWEVS/0wvJRj 
         iMIRrjECgYEA4Ffv9rBJXqVXonNQbbstd2PaprJDXMUy9/UmfHL6pkq1xdBeuM7v 
         yf2ocXheA8AahHtIOhtgKqwv/aRhVK0ErYtiSvIk+tXG+dAtj/1ZAKbKiFyxjkZV 
         X72BH7cTlR6As5SRRfWM/HaBGEgED391gKsI5PyMdqWWdczT5KfxAksCgYEAwXYE 
         ewPmV1GaR5fbh2RupoPnUJPMj36gJCnwls7sGaXDQIpdlq56zfKgrLocGXGgj+8f 
         QH7FHTJQO15YCYebtsXWwB3++iG43gVlJlecPAydsap2CCshqNWC5JU5pan0QzsP 
         exzNzWqfUPSbTkR2SRaN+MenZo2Y/WqScOAth7kCgYBgVoLujW9EXH5QfXJpXLq+ 
         jTvE38I7oVcs0bJwOLPYGzcJtlwmwn6IYAwohgbhV2pLv+EZSs42JPEK278MLKxY 
         lgVkp60npgunFTWroqDIvdc1TZDVxvA8h9VeODEJlSqxczgbMcIUXBM9yRctTI+5 
         7DiKlMUA4kTFW2sWwuOlFwKBgGXvrYS0FVbFJKm8lmvMu5D5x5RpjEu/yNnFT4Pn 
         G/iXoz4Kqi2PWh3STl804UF24cd1k94D7hDoReZCW9kJnz67F+C67XMW+bXi2d1O 
         JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN 
         ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI 
         pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9 
         1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ 
         -----END RSA PRIVATE KEY----- 
     InstanceTypes: 
     - Name: m4.large 
       VCPUs: 2 
       RAM: 7782000000 
       Scratch: 32000000000 
       Price: 0.1 
     - Name: m4.large.spot 
       Preemptible: true 
       VCPUs: 2 
       RAM: 7782000000 
       Scratch: 32000000000 
       Price: 0.1 
     - Name: m4.xlarge 
       VCPUs: 4 
       RAM: 15564000000 
       Scratch: 80000000000 
       Price: 0.2 
     - Name: m4.xlarge.spot 
       Preemptible: true 
       VCPUs: 4 
       RAM: 15564000000 
       Scratch: 80000000000 
       Price: 0.2 
     - Name: m4.2xlarge 
       VCPUs: 8 
       RAM: 31129000000 
       Scratch: 160000000000 
       Price: 0.4 
     - Name: m4.2xlarge.spot 
       Preemptible: true 
       VCPUs: 8 
       RAM: 31129000000 
       Scratch: 160000000000 
       Price: 0.4 
 </code></pre> 

 h2. Management API 

 APIs for monitoring/diagnostics/control are available via HTTP on a configurable address/port. Request headers must include "Authorization: Bearer {management token}". 

 Responses are JSON-encoded and resemble other Arvados APIs: 
 <pre><code class="json"> 
 { 
   "Items": [ 
     { 
       "Name": "...", 
       ... 
     }, 
     ... 
   ] 
 } 
 </code></pre> 

 @GET /arvados/v1/dispatch/instances@ lists cloud VMs. Each returned item includes: 
 * provider's instance ID 
 * hourly price (from configuration file) 
 * instance type (from configuration file) 
 * instance type (from provider's menu) 
 * UUID of the current / most recent container attempted (if known) 
 * time last container finished (or boot time, if nothing run yet) 

 @GET /arvados/v1/dispatch/containers@ lists queued/locked/running containers. Each returned item includes: 
 * container UUID 
 * container state (Queued/Locked/Running/Complete/Cancelled) 
 * desired instance type 
 * time appeared in queue 
 * time started (if started) 

 @POST /arvados/v1/dispatch/instances/:instance_id/drain@ puts an instance in "drain" state. 
 * if the instance is currently running a container, it is allowed to continue 
 * no further containers will be scheduled on the instance 
 * (TBD) the instance will not be shut down automatically 

 @POST /arvados/v1/dispatch/instances/:instance_id/shutdown@ puts an instance in "shutdown" state. 
 * if the instance is currently running a container, the instance is shut down when the container finishes 
 * otherwise, the instance is shut down immediately 

 h2. Metrics 

 (Future) Metrics are available via HTTP on a configurable address/port. Request headers must include "Authorization: Bearer {management token}". 

 Metrics include: 
 * (summary) time elapsed between VM creation and first successful SSH connection to that VM 
 * (summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM 
 * (gauge) total hourly price of all existing VMs 
 * (gauge) total VCPUs and memory allocated to containers 
 * (gauge) number of containers running 
 * (gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting) 
 * (gauge) number of containers not allocated to VMs (because provider quota is reached) 

 h2. Internal details 

 h3. Scheduling policy 

 The container priority field determines the order in which resources are allocated. 
 * If container C1 has priority P1, 
 * ...and C2 has higher priority P2, 
 * ...and there is no pending/booting/idle VM suitable for running C2, 
 * ...then C1 will not be started. 

 However, containers that run on different VM types don't necessarily start in priority order. 
 * If container C1 has priority P1, 
 * ...and C2 has higher priority P2, 
 * ...and there is no idle VM suitable for running C2, 
 * ...and there is a pending/booting VM that will be suitable for running C2 when it comes up, 
 * ...and there is an idle VM suitable for running C1, 
 * ...then C1 will start before C2. 

 h3. Special cases / synchronizing h2. Synchronizing state 

 When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly. 

 Some containers might have state=Locked Often, at startup. The dispatcher can't startup there will be sure these have no corresponding crunch-run process anywhere until it establishes communication some containers with all running instances. state=Locked. To avoid breaking priority order by guessing wrong, order, the dispatcher avoids scheduling won't schedule any new containers until all such "stale-locked" locked containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled). 

 When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container). 

 When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled. 

 h3. h2. Operator view 

 Management status endpoint provides: 
 * list of cloud VMs, each with 
 ** provider's instance ID 
 ** hourly price (from configuration file) 
 ** instance type (from configuration file) 
 ** instance type (from provider's menu) 
 ** UUID of the current / most recent container attempted (if known) 
 ** time last container finished (or boot time, if nothing run yet) 
 * list of queued/running containers, each with 
 ** UUID 
 ** state (queued/locked/running/complete/cancelled) 
 ** desired instance type 
 ** time appeared in queue 
 ** time started (if started) 

 Metrics endpoint tracks: 
 * (each VM) time elapsed between VM creation and first successful SSH connection 
 * (each VM) time elapsed between first successful SSH connection and ready to run a container 
 * total hourly price of all existing VMs 
 * total VCPUs and memory allocated to containers 
 * number of containers running 
 * number of containers allocated to VMs but not started yet (because VMs are pending/booting) 
 * number of containers not allocated to VMs (because provider quota is reached) 

 h2. SSH keys 

 The operator must install Each worker node has a public key in /root/.ssh/authorized_keys on each worker node. /root/.ssh/authorized_keys. Dispatcher has the corresponding private key. 

 (Future) Dispatcher generates its own keys and installs its public key on new VMs using cloud provider bootstrapping/metadata features. 

 h3. Probes 

 Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself. 

 Probe: 
 * Check whether the SSH connection is alive; reopen if needed. 
 * Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting. 
 * Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID) 

 Dispatcher, after a successful probe, should tag the cloud node record with the dispatcher's ID and probe timestamp. (In case the tagging API fails, remember the probe time in memory too.) 

 h3. Detecting dead/lame nodes 

 If a node has been up for N seconds without a successful probe, despite at least M attempts, it is shut down, even if it was running down. (M handles the case where the dispatcher restarts during a container last time it was contacted successfully. when the "update tags" operation isn't effective, e.g., provider is rate-limiting API calls.) 

 h3. Multiple dispatchers 

 Not supported in initial version.