Idea #6313
closed[Node Manager] Booting nodes shouldn't satisfy min_nodes
Description
The scenario¶
- Imagine Node Manager is configured with min_nodes = 1, running on a cluster where a compute node is sitting idle.
- Two jobs are submitted simultaneously.
- Node Manager checks the queue, and starts booting a new node.
- The two jobs both run very quickly on the idle node.
- The next time Node Manager polls the job queue, it's empty.
Currently, in this situation, Node Manager will shut down the idle node. It wants to shut down something, because it's managing more nodes than it needs (2 > 1). It can't shut down the booting node, because its shutdown window isn't open yet. This is surprising to users, because from the Dashboard, the number of running nodes appears to drop below min_nodes: the booting node hasn't pinged Arvados yet.
Proposed fix¶
Node Manager should decline to shut down a node if doing so would cause the number of paired nodes to fall below min_nodes.
Possible extension: Node Manager should boot a node if fewer than min_nodes nodes are paired with Arvados, unless that would cause the number of cloud nodes to exceed max_nodes. Assume that fresh nodes will eventually pair with Arvados, they just haven't yet.
This is just one idea. Other solutions may be better.
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)