Project

General

Profile

Bug #7286

Updated by Brett Smith about 9 years ago

"Broken" means it is no longer pinging Arvados, _and_ the cloud provider asserts that it is broken, _and_ one broken.    (If it only stopped pinging, that may be a network hiccup that may not affect a running compute job, so we should continue to err on the side of the following is true: not interrupting it.) 

 * The cloud node is unpaired, and at least boot_fail_after seconds old. 
 * The cloud node is paired, and the associated Arvados record has status "missing". 

 Steps: 

 * Add a method to node drivers that takes a cloud node record as an argument.    It returns True if the record indicates the node is broken, False otherwise. 
 * ComputeNodeMonitorActor suggests its node for shutdown if this it has not pinged Arvados for a while, and the new method returns True, and one of the conditions above is true. 
 * Remove the shutdown_unpaired_node logic from NodeManagerDaemonActor, since the point above effectively moves it to the ComputeNodeMonitorActor. 
 * Update the daemon's _nodes_wanted/_nodes_excess math so that we boot replacements for nodes in this failed state unless we're at max_nodes.    You could do this by simply counting them in _nodes_busy, but please be careful: we don't want to accidentally revert commit:e81a3e9b.    The daemon should be able to affirmatively know that the node is completely failed. True.

Back