Better spot instance support
12/31/2023 (Due in 31 days)
- Currently sitewide on/off choice, can't choose per-workflow
- Have to duplicate instance types in the config (obnoxious) (see #18596)
- Records the wrong price (uses price from instance type config not actual information from the cloud)
- Scheduling choices are too narrow, should be able to request different node types when the node you want isn't available
- Could we query spot prices on the fly to make scheduling decisions
- Try bigger instance types but only bid the spot price for the smallest node type
- Should eventually escalate to an on-demand instance if spot instance isn't available
- User should be able to communicate cost tolerance
- Want to try other availability zones, but requires feature of Keepstore running on compute nodes (#16516)
- Need better way to handle spot instance shutdown
- Maybe just always retry on a regular cost node
- Consider shutting down spot instances after a job because there is a timer?
- Need to research this more
- Can the VM be frozen / restored?