Feature #14325

Updated by Tom Clegg over 2 years ago

This issue covers the smallest version of [[Dispatching containers to cloud VMs]] that can be deployed on a dev cluster.

Background -- already done in #14360:
* Bring up nodes and run containers on them
* Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
* HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see [[Dispatching containers to cloud VMs#Operator view]] "Operator view"
* Shutdown idle nodes automatically
* Handle cloud API quota errors
* Package-building changes are in place, but commented out

Requirements covered here:
* Ops mechanism for draining a node (e.g., curl command using a management token) -- see [[Dispatching containers to cloud VMs#Management API]] "Management API"
* Resource consumption metrics (number of instances, number of containers running, total hourly price of all existing VMs) -- see [[Dispatching containers to cloud VMs#Metrics]] "Metrics"
* Drain (not kill) instances that exist at startup, fail boot probe, but are already running containers -- see [[Dispatching containers to cloud VMs#Special cases / synchronizing state]] "Special cases / synchronizing state"
* Configurable port number for connecting to VM SSH servers
* Pass API host and dispatcher's token to crunch-run command via @ARVADOS_API_*@ environment variables
* Test SSH host key verification (dispatcher's token is not sent to a remote host unless the host's SSH key passes the VerifyHostKey() method provided by the cloud driver)
* Test container.Queue using real railsAPI/controller
* Test resuming state after restart (some instances are booting, some idle, some running containers, some draining, some on admin-hold)
* Cancel container after some number of start/requeue cycles (i.e., @crunch-run --detach@ succeeded, but child exited without moving container past Locked state)
* Cancel container with no suitable instance type
* Enable package build
* Handle cloud API ratelimit errors (obey holdoff time returned by driver... incl. test)
* Update management API response format (lowercase keys)
* Ensure all probe failures are logged once instance is booted (see #14360#note-38)

Requirements covered elsewhere:
* One cloud vendor driver (Azure = #14324)

Non-requirements (can wait until after first dev deploy):
* Update runtime_status field when cancelling containers
* Ops mechanism for hold/release (add tags so hold state survives dispatcher restart)
* Test activity/resource usage metrics
* crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs
* crunch-run --detach: cleanup old stdout/stderr
* Clean up testing code -- eliminate LameInstanceSet in favor of test.StubDriver
* Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM
* Shutdown node if container process still running after several SIGKILL attempts
* "Broken node" hook
* Multiple cloud drivers
* Test suite that uses a real cloud provider
* Prometheus metrics (containers in queue, time container queued before starting, workers in each state, etc)
* Periodic status reports in logs
* Optimize worker VM deployment (for now, we still expect the operator to provide an image with a suitable version of crunch-run)
* Configurable spending limits
* Generic driver test suite
* Cancel containers that don't get scheduled after some time limit (no nodes ever come up?)
* Rate-limit or serialize creation of new instance types (when multiple instance types are created concurrently, the cloud might create the lower-priority types but then reach quota before creating the higher-priority types; see #14360#note-36)

* [[Dispatching containers to cloud VMs]]
* #13964 spike