Bug #7286
Updated by Brett Smith about 9 years ago
"Broken" means it is no longer pinging Arvados, _and_ the cloud provider asserts that it is broken, _and_ one broken. (If it only stopped pinging, that may be a network hiccup that may not affect a running compute job, so we should continue to err on the side of the following is true: not interrupting it.) * The cloud node is unpaired, and at least boot_fail_after seconds old. * The cloud node is paired, and the associated Arvados record has status "missing". Steps: * Add a method to node drivers that takes a cloud node record as an argument. It returns True if the record indicates the node is broken, False otherwise. * ComputeNodeMonitorActor suggests its node for shutdown if this it has not pinged Arvados for a while, and the new method returns True, and one of the conditions above is true. * Remove the shutdown_unpaired_node logic from NodeManagerDaemonActor, since the point above effectively moves it to the ComputeNodeMonitorActor. * Update the daemon's _nodes_wanted/_nodes_excess math so that we boot replacements for nodes in this failed state unless we're at max_nodes. You could do this by simply counting them in _nodes_busy, but please be careful: we don't want to accidentally revert commit:e81a3e9b. The daemon should be able to affirmatively know that the node is completely failed. True.