Dispatching containers to cloud VMs » History » Revision 29
« Previous |
Revision 29/82
(diff)
| Next »
Tom Clegg, 11/09/2018 09:22 PM
Dispatching containers to cloud VMs¶
(Draft)
- Table of contents
- Dispatching containers to cloud VMs
- Future plans / features
Component name / purpose¶
crunch-dispatch-cloud runs Arvados user containers on generic public cloud infrastructure by automatically creating and destroying VMs of various sizes according to demand, preparing the VMs' runtime environments, and running containers on them.
Deployment¶
Where to install: The crunch-dispatch-cloud process can run anywhere, as long as it has network access to the Arvados controller, the cloud provider's API, and the worker VMs. Each Arvados cluster should run only one crunch-dispatch-cloud process (future versions will support multiple dispatchers).
Dispatcher's SSH key: The operator must generate an SSH key pair for the dispatcher to use when connecting to cloud VMs. The private key is stored (without a passphrase) in the cluster configuration file. It does not need to be saved in ~/.ssh/
.
/root/.ssh/authorized_keys
. The image should also include suitable versions of docker and crunch-run. (Alternatively, it is possible to install these using a custom boot probe command, but pre-installing is more efficient.)
- (Future) Configurable SSH port number.
- (Future) Automatically sync the crunch-run binary from the dispatcher host to each worker node.
Cloud provider account: The dispatcher uses cloud provider credentials to create and delete VMs and other cloud resources. An Arvados user can create an arbitrary number of long-running containers, and the dispatcher will try to run all of them. Currently the dispatcher does not enforce any resource limits of its own, so the installer must ensure the cloud provider itself is enforcing a suitable quota.
Migrating from nodemanager/SLURM: When VM images, SSH keys, and configuration files are ready, disable nodemanager and crunch-dispatch-slurm. Install crunch-dispatch-cloud deb/rpm package. Confirm success with systemctl status crunch-dispatch-cloud
and journalctl -fu crunch-dispatch-cloud
.
Overview of operation¶
The dispatcher waits for containers to appear in the queue, and runs them on appropriately sized cloud VMs. When there are no idle cloud VMs with the desired size, the dispatcher brings up more VMs using the cloud provider's API. The dispatcher also shuts down idle VMs that exceed the configured idle timer -- and sooner if the provider starts refusing to create new VMs.
Interaction with other components¶
Controller (backed by RailsAPI and PostgreSQL) supplies the container queue: which containers the system should be trying to execute (or cancel) at any given time.
The cloud provider's API supplies a list of VMs that exist (or are being created) at a given time and their network addresses, accepts orders to create new VMs, updates instance tags, and (optionally, depending on the driver) obtains the VMs' SSH server public keys.
The SSH server on each cloud VM allows the dispatcher to authenticate with a private key and execute shell commands as root.
Configuration¶
Arvados configuration (currently a file in /etc) supplies cloud provider credentials, allowed node types, spending limits/policies, etc.
CloudVMs:
BootProbeCommand: "docker ps -q"
SyncInterval: 1m # how often to get list of active instances from cloud provider
TimeoutIdle: 1m # shutdown if idle longer than this
TimeoutBooting: 10m # shutdown if exists longer than this without running BootProbeCommand successfully
TimeoutProbe: 2m # shutdown if (after booting) communication fails longer than this, even if ctrs are running
TimeoutShutdown: 1m # shutdown again if node still exists this long after shutdown
Driver: Amazon
DriverParameters: # following configs are driver dependent
Region: us-east-1
APITimeout: 20s
EC2Key: abcdef
EC2Secret: abcdefghijklmnopqrstuvwxyz
StorageKey: abcdef
StorageSecret: abcdefghijklmnopqrstuvwxyz
ImageID: ami-0123456789abcdef0
SubnetID: subnet-01234567
SecurityGroups: sg-01234567
Dispatch:
StaleLockTimeout: 1m # after restart, time to wait for workers to come up before abandoning locks from previous run
PollInterval: 1m # how often to get latest queue from arvados controller
ProbeInterval: 10s # how often to probe each instance for current status/vital signs
MaxProbesPerSecond: 1000 # limit total probe rate for dispatch process (across all instances)
PrivateKey: | # SSH key able to log in as root@ worker VMs
-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEAqYm4XsQHm8sBSZFwUX5VeW1OkGsfoNzcGPG2nzzYRhNhClYZ
0ABHhUk82HkaC/8l6d/jpYTf42HrK42nNQ0r0Yzs7qw8yZMQioK4Yk+kFyVLF78E
GRG4pGAWXFs6pUchs/lm8fo9zcda4R3XeqgI+NO+nEERXmdRJa1FhI+Za3/S/+CV
mg+6O00wZz2+vKmDPptGN4MCKmQOCKsMJts7wSZGyVcTtdNv7jjfr6yPAIOIL8X7
LtarBCFaK/pD7uWll/Uj7h7D8K48nIZUrvBJJjXL8Sm4LxCNoz3Z83k8J5ZzuDRD
gRiQe/C085mhO6VL+2fypDLwcKt1tOL8fI81MwIDAQABAoIBACR3tEnmHsDbNOav
Oxq8cwRQh9K2yDHg8BMJgz/TZa4FIx2HEbxVIw0/iLADtJ+Z/XzGJQCIiWQuvtg6
exoFQESt7JUWRWkSkj9JCQJUoTY9Vl7APtBpqG7rIEQzd3TvzQcagZNRQZQO6rR7
p8sBdBSZ72lK8cJ9tM3G7Kor/VNK7KgRZFNhEWnmvEa3qMd4hzDcQ4faOn7C9NZK
dwJAuJVVfwOLlOORYcyEkvksLaDOK2DsB/p0AaCpfSmThRbBKN5fPXYaKgUdfp3w
70Hpp27WWymb1cgjyqSH3DY+V/kvid+5QxgxCBRq865jPLn3FFT9bWEVS/0wvJRj
iMIRrjECgYEA4Ffv9rBJXqVXonNQbbstd2PaprJDXMUy9/UmfHL6pkq1xdBeuM7v
yf2ocXheA8AahHtIOhtgKqwv/aRhVK0ErYtiSvIk+tXG+dAtj/1ZAKbKiFyxjkZV
X72BH7cTlR6As5SRRfWM/HaBGEgED391gKsI5PyMdqWWdczT5KfxAksCgYEAwXYE
ewPmV1GaR5fbh2RupoPnUJPMj36gJCnwls7sGaXDQIpdlq56zfKgrLocGXGgj+8f
QH7FHTJQO15YCYebtsXWwB3++iG43gVlJlecPAydsap2CCshqNWC5JU5pan0QzsP
exzNzWqfUPSbTkR2SRaN+MenZo2Y/WqScOAth7kCgYBgVoLujW9EXH5QfXJpXLq+
jTvE38I7oVcs0bJwOLPYGzcJtlwmwn6IYAwohgbhV2pLv+EZSs42JPEK278MLKxY
lgVkp60npgunFTWroqDIvdc1TZDVxvA8h9VeODEJlSqxczgbMcIUXBM9yRctTI+5
7DiKlMUA4kTFW2sWwuOlFwKBgGXvrYS0FVbFJKm8lmvMu5D5x5RpjEu/yNnFT4Pn
G/iXoz4Kqi2PWh3STl804UF24cd1k94D7hDoReZCW9kJnz67F+C67XMW+bXi2d1O
JIBvlVfcHb1IHMA9YG7ZQjrMRmx2Xj3ce4RVPgUGHh8ra7gvLjd72/Tpf0doNClN
ti/hAoGBAMW5D3LhU05LXWmOqpeT4VDgqk4MrTBcstVe7KdVjwzHrVHCAmI927vI
pjpphWzpC9m3x4OsTNf8m+g6H7f3IiQS0aiFNtduXYlcuT5FHS2fSATTzg5PBon9
1E6BudOve+WyFyBs7hFWAqWFBdWujAl4Qk5Ek09U2ilFEPE7RTgJ
-----END RSA PRIVATE KEY-----
InstanceTypes:
- Name: m4.large
VCPUs: 2
RAM: 7782000000
Scratch: 32000000000
Price: 0.1
- Name: m4.large.spot
Preemptible: true
VCPUs: 2
RAM: 7782000000
Scratch: 32000000000
Price: 0.1
- Name: m4.xlarge
VCPUs: 4
RAM: 15564000000
Scratch: 80000000000
Price: 0.2
- Name: m4.xlarge.spot
Preemptible: true
VCPUs: 4
RAM: 15564000000
Scratch: 80000000000
Price: 0.2
- Name: m4.2xlarge
VCPUs: 8
RAM: 31129000000
Scratch: 160000000000
Price: 0.4
- Name: m4.2xlarge.spot
Preemptible: true
VCPUs: 8
RAM: 31129000000
Scratch: 160000000000
Price: 0.4
Management API¶
APIs for monitoring/diagnostics/control are available via HTTP on a configurable address/port. Request headers must include "Authorization: Bearer {management token}".
Responses are JSON-encoded and resemble other Arvados APIs:
{
"Items": [
{
"Name": "...",
...
},
...
]
}
GET /arvados/v1/dispatch/instances
lists cloud VMs. Each returned item includes:
- provider's instance ID
- hourly price (from configuration file)
- instance type (from configuration file)
- instance type (from provider's menu)
- UUID of the current / most recent container attempted (if known)
- time last container finished (or boot time, if nothing run yet)
GET /arvados/v1/dispatch/containers
lists queued/locked/running containers. Each returned item includes:
- container UUID
- container state (Queued/Locked/Running/Complete/Cancelled)
- desired instance type
- time appeared in queue
- time started (if started)
POST /arvados/v1/dispatch/instances/:instance_id/drain
puts an instance in "drain" state.
- if the instance is currently running a container, it is allowed to continue
- no further containers will be scheduled on the instance
- (TBD) the instance will not be shut down automatically
POST /arvados/v1/dispatch/instances/:instance_id/shutdown
puts an instance in "shutdown" state.
- if the instance is currently running a container, the instance is shut down when the container finishes
- otherwise, the instance is shut down immediately
Metrics¶
Metrics are available via HTTP on a configurable address/port (conventionally :9005). Request headers must include "Authorization: Bearer {management token}".
Metrics include:- [future] (summary) time elapsed between VM creation and first successful SSH connection to that VM
- [future] (summary) time elapsed between first successful SSH connection on a VM and ready to run a container on that VM
- (gauge) total hourly price of all existing VMs
- (gauge) total VCPUs and memory allocated to containers
- (gauge) number of containers running
- (gauge) number of containers allocated to VMs but not started yet (because VMs are pending/booting)
- (gauge) number of containers not allocated to VMs (because provider quota is reached)
Logs¶
For purposes of troubleshooting, a JSON-formatted log entry is printed on stderr when...
..including timestamp and... | |
a new instance is created/ordered | instance type name |
an instance appears on the provider's list of instances | instance ID |
an instance's boot probe succeeds | instance ID |
an instance is shut down after boot timeout | instance ID, stdout/stderr/error from last boot probe attempt |
an instance shutdown is requested | instance ID |
an instance disappears from the provider's list of instances | instance ID and previous state (booting/idle/shutdown) |
a cloud provider API or driver error occurs | provider/driver's error message |
a new container appears in the Arvados queue | container UUID, desired instance type name |
a container is locked by the dispatcher | container UUID |
a crunch-run process is started on an instance | container UUID, instance ID, crunch-run PID |
a crunch-run process fails to start on an instance | container UUID, instance ID, stdout/stderr/exitcode |
a crunch-run process ends | container UUID, instance ID |
an active container's state changes to Complete or Cancelled | container UUID, new state |
an active container is requeued after being locked | container UUID |
an Arvados API error occurs | error message |
(Example log entries should be shown here)
If the dispatcher starts with a non-empty ARVADOS_DEBUG environment variable, it also prints more detailed logs about other internal state changes, using level=debug.
Internal details¶
Scheduling policy¶
The container priority field determines the order in which resources are allocated.- If container C1 has priority P1,
- ...and C2 has higher priority P2,
- ...and there is no pending/booting/idle VM suitable for running C2,
- ...then C1 will not be started.
- If container C1 has priority P1,
- ...and C2 has higher priority P2,
- ...and there is no idle VM suitable for running C2,
- ...and there is a pending/booting VM that will be suitable for running C2 when it comes up,
- ...and there is an idle VM suitable for running C1,
- ...then C1 will start before C2.
Special cases / synchronizing state¶
When first starting up, dispatcher inspects API server’s container queue and the cloud provider’s list of dispatcher-tagged cloud nodes, and restores internal state accordingly.
Some containers might have state=Locked at startup. The dispatcher can't be sure these have no corresponding crunch-run process anywhere until it establishes communication with all running instances. To avoid breaking priority order by guessing wrong, the dispatcher avoids scheduling any new containers until all such "stale-locked" containers are matched up with crunch-run processes on existing VMs (typically preparing a docker image) or all of the existing VMs have been probed successfully (meaning the locked containers aren't running anywhere and need to be rescheduled).
When a user cancels a container request with state=Locked or Running, the container priority changes to 0. On its next poll, the dispatcher notices this and kills any corresponding crunch-run processes (or, if there is no such process, just unlocks the container).
When a crunch-run process ends without finalizing its container's state, the dispatcher notices this and sets state to Cancelled.
Probes¶
Sometimes (on the happy path) the dispatcher knows the state of each worker, whether it's idle, and which container it's running. In general, it's necessary to probe the worker node itself.
Probe:- Check whether the SSH connection is alive; reopen if needed.
- Run the configured "ready?" command (e.g., "grep /encrypted-tmp /etc/mtab"); if this fails, conclude the node is still booting.
- Run "crunch-run --list" to get a list of crunch-run supervisors (pid + container UUID)
Detecting dead/lame nodes¶
If a node has been up for N seconds without a successful probe, despite at least M attempts, it is shut down, even if it was running a container last time it was contacted successfully.
Future plans / features¶
Per-instance-type VM images: It can be useful to run differently configured/tuned kernels/systems on different instance types, use different ops/monitoring systems on preemptible instances, etc. In addition to a system-wide default, each instance type could optionally specify an image.
Selectable VM images: When upgrading a production system, it can be useful to run a few trial containers on a new VM image before making it the default.
Updated by Tom Clegg about 6 years ago · 82 revisions