Replace SLURM for cloud job scheduling/dispatching
- Metrics that indicate cloud failure (time we’ve spent trying but failing to create a new instance)
- Integration test that uses a loopback driver to execute crunch-run on localhost (this verifies the interface between dispatcher and crunch-run)
- Add tests for activity/resource usage metrics
- Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker), see Dispatching containers to cloud VMs
- Cloud behavior metrics: count unexpected shutdowns, split by instance type
- Configurable spending limits
- Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case)
- (API) Allow admin users to specify image ID in runtime_constraints; (dispatcher) if present, use runtime_constraints image ID instead of image ID from cluster config file
- Atomically install correct version of crunch-run (perhaps /proc/self/exe) to worker VM as part of boot probe
- Run crunch-run as a non-root user
- Don't require root at all on the cloud instance
- Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint.
- Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed.