[Node Manager] Should recognize and shut down broken nodes
(Total: 0.00 h)
"Broken" means the cloud provider asserts that it is broken, and one of the following is true:
- The cloud node is unpaired, and at least boot_fail_after seconds old.
- The cloud node is paired, and the associated Arvados record has status "missing".
- Add a method to node drivers that takes a cloud node record as an argument. It returns True if the record indicates the node is broken, False otherwise.
- ComputeNodeMonitorActor suggests its node for shutdown if this new method returns True, and one of the conditions above is true.
- Remove the shutdown_unpaired_node logic from NodeManagerDaemonActor, since the point above effectively moves it to the ComputeNodeMonitorActor.
- Update the daemon's _nodes_wanted/_nodes_excess math so that we boot replacements for nodes in this failed state unless we're at max_nodes. You could do this by simply counting them in _nodes_busy, but please be careful: we don't want to accidentally revert e81a3e9b. The daemon should be able to affirmatively know that the node is completely failed.