Project

General

Profile

Idea #14807

Updated by Tom Clegg about 5 years ago

Improvements necessary to run in production: 
 * Send SIGKILL if container process still running after several SIGTERM attempts / N seconds after first SIGTERM 
 * Shutdown node if container process still running after several SIGKILL attempts 
 * Propagate configured "check for broken node" script name to crunch-run 
 * Send detached crunch-run stdout+stderr to systemd journal so sysadmin can make subsequent arrangements if needed 
 * Configurable rate limit for Create and Destroy calls to cloud API (background: reaching API call rate limits can cause penalties; also, when multiple instance types are created concurrently, the cloud might create the lower-priority types but then reach quota before creating the higher-priority types; see #14360#note-36) 
 * Metrics: total cost of nodes in idle or booting state 
 * Metrics: total cost of nodes with admin-hold flag set 
 * Log when any instance goes down unexpectedly (i.e., state != Shutdown when deleted from list) 

 Improvements that are desired, but not necessary to run in production (noted here for clarity until they move to their own tickets): 
 * crunch-run --detach: retrieve stdout/stderr during probe, and show it in dispatcher logs 
 * crunch-run --detach: cleanup old stdout/stderr 
 * Metrics that indicate cloud failure (time we’ve spent trying but failing to create a new instance) 
 * Test suite that uses a real cloud provider 
 * Test activity/resource usage metrics 
 * Multiple cloud drivers 
 * Generic driver test suite 
 * Performance metrics for dispatching (e.g., time between seeing a container in the queue and starting its crunch-run process on a worker) 
 * Optimize worker VM deployment (e.g., automatically install a matching version of crunch-run on each worker) 
 * Configurable spending limits 
 * Update runtime_status field when cancelling containers after crunch-run crashes or the cloud VM dies without finalizing the container (already done for the “no suitable instance type” case) 
 * If present, use VM image ID given in runtime_constraints instead of image ID from cluster config file 
 * (API) Allow admin users to specify image ID in runtime_constraints 
 * Metrics: count unexpected shutdowns, split by instance type 
 * Don't add "crunch" user in Azure driver (either add the key to root's authorized_keys if that's confirmed not to delay the boot process, or don't do it at all) 

 Improvements that might never be implemented at all (noted here for clarity): 
 * Periodic status reports in logs. This kind of logging should normally (always?) be handled by an external monitoring system that connects to the existing metrics endpoint. 
 * Cancel containers that take longer than a configurable time limit to schedule (e.g., no nodes ever come up). Unsure whether this is useful: maybe containers should just stay queued until the problem is fixed. 

Back