Feature #14325

Updated by Tom Clegg over 2 years ago

This issue covers the smallest version that can be deployed on a dev cluster.

Background -- already done in #14360:
* Bring up nodes and run containers on them
* Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
* HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see [[Dispatching containers to cloud VMs#Operator view]] "Operator view"
* Shutdown idle nodes automatically
* Handle cloud API quota/ratelimit errors
* Package-building changes are in place, but commented out

* One cloud vendor driver (Azure = #14324)
* Ops mechanism for draining a node (e.g., curl command using a management token)
* Resource consumption metrics (instances running/allocated, hourly cost)
* Go from unknown/booting to drain state automatically if boot probe fails + containers are running
* Configurable port number for connecting to VM SSH servers
* Pass API host and token to crunch-run command
* Test SSH host key verification
* Test container.Queue using real railsAPI/controller
* Test resuming state after restart (some instances are booting, some idle, some running containers, some on admin-hold)
* Cancel containers that can't be scheduled
* Cancel container after some number of start/requeue cycles
* Cancel container with no suitable instance type
* Enable package build

Undecided: (might not be blockers for first dev deploy)
* Update runtime_status field when cancelling containers
* Ops mechanism for hold/release (add tags so hold state survives dispatcher restart)
* Test activity/resource usage metrics
* "Broken node" hook
* crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs
* crunch-run --detach: cleanup old stdout/stderr
* Handle cloud API ratelimit errors
* Clean up testing code -- eliminate LameInstanceSet in favor of test.StubDriver
* Eliminate races in lockfile code (currently a probe can interfere with a start -- can fix by locking a second lockfile without LOCK_NB, and using that second lockfile to check for liveness during probe)
Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM
* Shutdown node if container process still running after several SIGKILL attempts

* Multiple cloud drivers
* Test suite that uses a real cloud provider
* Prometheus metrics (containers in queue, time container queued before starting, workers in each state, etc)
* Periodic status reports in logs
* Optimize worker VM deployment (for now, we still expect the operator to provide an image with a suitable version of crunch-run)
* Configurable spending limits
* Generic driver test suite

* [[Dispatching containers to cloud VMs]]
* #13964 spike