Feature #14325

Updated by Tom Clegg about 2 years ago

This issue covers the smallest version that can be deployed on a dev cluster.

Requirements:
* One cloud vendor driver (Azure = #14324)
* Bring up nodes and run containers on them
* Ops mechanism for draining a node (e.g., curl command using a management token)
* HTTP status report with current set of containers (queued/running) and VMs (busy/idle) -- see [[Dispatching containers to cloud VMs#Operator view]] "Operator view"
* Structured logs for diagnostics+statistics: cloud API errors, node lifecycle, container lifecycle
* Resource consumption metrics (instances running/allocated, hourly cost)
* Shutdown idle nodes automatically
* Handle cloud API quota/ratelimit errors
* Cancel containers that can't be scheduled
* Go from unknown/booting to drain state automatically if boot probe fails + containers are running


Non-requirements:
* Multiple cloud drivers
* Test suite that uses a real cloud provider
* Performance metrics
* Periodic status reports in logs
* Optimize worker VM deployment (for now, we still expect the operator to provide an image with a suitable version of crunch-run)
* Configurable spending limits

Refs
* [[Dispatching containers to cloud VMs]]
* #13964 spike

Back