Bug #5736
closed[Node Manager] Reuse node records after shutting down the cloud node set up with them
Description
When Node Manager uses an Arvados node record to set up a new compute node, it records the time that setup starts. It won't reuse this record for another setup for node_stale_time
seconds, even if the setup is aborted because the cloud node is no longer needed, or fails to pair with its Arvados node. It would help if Node Manager could reuse those records faster; we just ran into an issue where there was slot_number exhaustion on a cluster because it wasn't reusing records for this reason.
This is non-trivial, because Node Manager retains no memory of what Arvados node record was used to set up a compute node. ComputeNodeMonitorActor can be initialized with an arvados_node, but the daemon is relying on that being initialized as None, and set during the pairing process, to detect unpaired nodes. Implementing this will require more involved state tracking.
Conflicts with #4129.
Updated by Brett Smith almost 10 years ago
If it helps, here's one daemon test I wrote for the behavior.
def test_old_arvados_node_reused_again_after_shutdown(self):
arv_node = testutil.arvados_node_mock(4, age=9000)
size = testutil.MockSize(4)
self.make_daemon(arvados_nodes=[arv_node])
self.daemon.update_server_wishlist([size]).get(self.TIMEOUT)
self.last_setup.cloud_node.get.return_value = (
testutil.cloud_node_mock(5))
self.daemon.node_up(self.last_setup).get(self.TIMEOUT)
# Now deliver the "shutdown unpaired node" message.
self.timer.deliver()
self.daemon.update_server_wishlist([size]).get(self.TIMEOUT)
self.stop_proxy(self.daemon)
used_nodes = [call[1].get('arvados_node')
for call in self.node_setup.start.call_args_list]
self.assertEqual(2, len(used_nodes))
self.assertIs(arv_node, used_nodes[0])
self.assertIs(arv_node, used_nodes[1])
As of this writing, the last assertion fails.
Updated by Brett Smith almost 10 years ago
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2015-05-20 sprint
Updated by Brett Smith almost 10 years ago
5736-node-manager-easy-slot-cleanup-wip is up for review. It teaches Node Manager to clean up node records after shutting down paired nodes. This case is easy to address, and should help prevent slot number exhaustion under normal operation.
It does not clean up records when nodes fail to bootstrap, so it does not completely solve the problem, or close this ticket. But it gets us most of the benefit and is much easier to do, so Ward asked for it to be addressed separately sooner, and that seemed sensible to me.
Updated by Brett Smith almost 10 years ago
- Target version changed from 2015-05-20 sprint to Arvados Future Sprints
Addressing the harder case (where we boot a node, it fails to bootstrap, and we need to recycle the node record we seeded it with) can come later.
Updated by Tom Morris over 8 years ago
- Assigned To changed from Brett Smith to Tom Morris
Updated by Ward Vandewege over 4 years ago
- Target version deleted (
Arvados Future Sprints)