Story #13908

Updated by Tom Clegg over 1 year ago

See [[Dispatching containers to cloud VMs]]

Outstanding TODOs not covered by a linked ticket:
* Metrics that indicate cloud failure (time we’ve spent trying but failing to create a new instance)
* Integration test that uses a loopback driver to execute crunch-run on localhost (this verifies the interface between dispatcher and crunch-run)
* Add tests for activity/resource usage metrics
* Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker), see [[Dispatching containers to cloud VMs#Metrics]]
* Cloud behavior metrics: count unexpected shutdowns, split by instance type
* Configurable spending limits
* Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case)
* (API) Allow admin users to specify image ID in runtime_constraints; (dispatcher) if present, use runtime_constraints image ID instead of image ID from cluster config file
* Atomically install correct version of crunch-run (perhaps /proc/self/exe) to worker VM as part of boot probe
* Run crunch-run as a non-root user
* Don't require root at all on the cloud instance

Outstanding TODO-or-maybe-not-TODOs not covered by a linked ticket:
* Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint.
* Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed.

Back