Feature #14325

Updated by Tom Clegg about 2 years ago

This issue covers the smallest version that can be deployed on a dev cluster.

Background -- already done in #14360: Requirements:
* One cloud vendor driver (Azure = #14324)
*
Bring up nodes and run containers on them
* Structured logs Ops mechanism for diagnostics+statistics: cloud API errors, draining a node lifecycle, container lifecycle (e.g., curl command using a management token)
* HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see [[Dispatching containers to cloud VMs#Operator view]] "Operator view"
* Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
* Resource consumption metrics (instances running/allocated, hourly cost)
*
Shutdown idle nodes automatically
* Handle cloud API quota/ratelimit errors
* Package-building is

Requirements:
Cancel containers that can't be scheduled
* One cloud vendor driver (Azure = #14324)
* Ops mechanism for draining a node (e.g., curl command using a management token)
* Resource consumption metrics (instances running/allocated, hourly cost)
*
Go from unknown/booting to drain state automatically if boot probe fails + containers are running
* Configurable port number for connecting to VM SSH servers
* Pass API host and token to crunch-run command
* Test SSH host key verification
* Test container.Queue using real railsAPI/controller
* Test resuming state after restart (some instances are booting, some idle, some running containers, some on admin-hold)
* Cancel containers that can't be scheduled
* Cancel container after some number of start/requeue cycles
* Cancel container with no suitable instance type
* Enable package build


Undecided: (might not be blockers for first dev deploy)
* Update runtime_status field when cancelling containers
* Ops mechanism for hold/release (add tags so hold state survives dispatcher restart)
* Test activity/resource usage metrics
* "Broken node" hook
* crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs
* crunch-run --detach: cleanup old stdout/stderr
* Handle cloud API ratelimit errors
* Clean up testing code -- eliminate LameInstanceSet in favor of test.StubDriver

Non-requirements:
* Multiple cloud drivers
* Test suite that uses a real cloud provider
* Performance metrics
* Periodic status reports in logs
* Optimize worker VM deployment (for now, we still expect the operator to provide an image with a suitable version of crunch-run)
* Configurable spending limits
* Generic driver test suite


Refs
* [[Dispatching containers to cloud VMs]]
* #13964 spike

Back