Project

General

Profile

Idea #13908

Updated by Tom Clegg almost 5 years ago

See [[Dispatching containers to cloud VMs]] 

 Outstanding TODOs not covered by a linked ticket: 
 * Metrics that indicate cloud failure (time we’ve spent trying but failing to create a new instance) 
 * Integration test that uses a loopback driver to execute crunch-run on localhost (this verifies the interface between dispatcher and crunch-run) 
 * Add tests for activity/resource usage metrics 
 * Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker), see [[Dispatching containers to cloud VMs#Metrics]] 
 * Cloud behavior metrics: count unexpected shutdowns, split by instance type 
 * Configurable spending limits 
 * Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case) 
 * (API) Allow admin users to specify image ID in runtime_constraints; (dispatcher) if present, use runtime_constraints image ID instead of image ID from cluster config file 
 * Atomically install correct version of crunch-run (perhaps /proc/self/exe) to worker VM as part of boot probe 
 * Run crunch-run as a non-root user 
 * Don't require root at all on the cloud instance 

 Outstanding TODO-or-maybe-not-TODOs not covered by a linked ticket: 
 * Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint. 
 * Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed. 

Back