[Node Manager] Reuse node records after shutting down the cloud node set up with them
When Node Manager uses an Arvados node record to set up a new compute node, it records the time that setup starts. It won't reuse this record for another setup for
node_stale_time seconds, even if the setup is aborted because the cloud node is no longer needed, or fails to pair with its Arvados node. It would help if Node Manager could reuse those records faster; we just ran into an issue where there was slot_number exhaustion on a cluster because it wasn't reusing records for this reason.
This is non-trivial, because Node Manager retains no memory of what Arvados node record was used to set up a compute node. ComputeNodeMonitorActor can be initialized with an arvados_node, but the daemon is relying on that being initialized as None, and set during the pairing process, to detect unpaired nodes. Implementing this will require more involved state tracking.
Conflicts with #4129.
#1 Updated by Brett Smith about 4 years ago
If it helps, here's one daemon test I wrote for the behavior.
def test_old_arvados_node_reused_again_after_shutdown(self): arv_node = testutil.arvados_node_mock(4, age=9000) size = testutil.MockSize(4) self.make_daemon(arvados_nodes=[arv_node]) self.daemon.update_server_wishlist([size]).get(self.TIMEOUT) self.last_setup.cloud_node.get.return_value = ( testutil.cloud_node_mock(5)) self.daemon.node_up(self.last_setup).get(self.TIMEOUT) # Now deliver the "shutdown unpaired node" message. self.timer.deliver() self.daemon.update_server_wishlist([size]).get(self.TIMEOUT) self.stop_proxy(self.daemon) used_nodes = [call.get('arvados_node') for call in self.node_setup.start.call_args_list] self.assertEqual(2, len(used_nodes)) self.assertIs(arv_node, used_nodes) self.assertIs(arv_node, used_nodes)
As of this writing, the last assertion fails.
#3 Updated by Brett Smith about 4 years ago
5736-node-manager-easy-slot-cleanup-wip is up for review. It teaches Node Manager to clean up node records after shutting down paired nodes. This case is easy to address, and should help prevent slot number exhaustion under normal operation.
It does not clean up records when nodes fail to bootstrap, so it does not completely solve the problem, or close this ticket. But it gets us most of the benefit and is much easier to do, so Ward asked for it to be addressed separately sooner, and that seemed sensible to me.