Story #6313

[Node Manager] Booting nodes shouldn't satisfy min_nodes

Added by Brett Smith over 4 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
Start date:
06/11/2015
Due date:
% Done:

0%

Estimated time:
Story points:
1.0

Description

The scenario

  • Imagine Node Manager is configured with min_nodes = 1, running on a cluster where a compute node is sitting idle.
  • Two jobs are submitted simultaneously.
  • Node Manager checks the queue, and starts booting a new node.
  • The two jobs both run very quickly on the idle node.
  • The next time Node Manager polls the job queue, it's empty.

Currently, in this situation, Node Manager will shut down the idle node. It wants to shut down something, because it's managing more nodes than it needs (2 > 1). It can't shut down the booting node, because its shutdown window isn't open yet. This is surprising to users, because from the Dashboard, the number of running nodes appears to drop below min_nodes: the booting node hasn't pinged Arvados yet.

Proposed fix

Node Manager should decline to shut down a node if doing so would cause the number of paired nodes to fall below min_nodes.

Possible extension: Node Manager should boot a node if fewer than min_nodes nodes are paired with Arvados, unless that would cause the number of cloud nodes to exceed max_nodes. Assume that fresh nodes will eventually pair with Arvados, they just haven't yet.

This is just one idea. Other solutions may be better.

Also available in: Atom PDF