Project

General

Profile

Actions

Bug #5736

closed

[Node Manager] Reuse node records after shutting down the cloud node set up with them

Added by Brett Smith about 9 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
-
Story points:
1.0

Description

When Node Manager uses an Arvados node record to set up a new compute node, it records the time that setup starts. It won't reuse this record for another setup for node_stale_time seconds, even if the setup is aborted because the cloud node is no longer needed, or fails to pair with its Arvados node. It would help if Node Manager could reuse those records faster; we just ran into an issue where there was slot_number exhaustion on a cluster because it wasn't reusing records for this reason.

This is non-trivial, because Node Manager retains no memory of what Arvados node record was used to set up a compute node. ComputeNodeMonitorActor can be initialized with an arvados_node, but the daemon is relying on that being initialized as None, and set during the pairing process, to detect unpaired nodes. Implementing this will require more involved state tracking.

Conflicts with #4129.


Subtasks 1 (0 open1 closed)

Task #5995: Review 5736-node-manager-easy-slot-cleanup-wipResolvedWard Vandewege05/11/2015Actions

Related issues

Related to Arvados - Bug #4129: [Node Manager] Don't reuse node table recordsClosedActions
Actions #1

Updated by Brett Smith about 9 years ago

If it helps, here's one daemon test I wrote for the behavior.

   def test_old_arvados_node_reused_again_after_shutdown(self):
        arv_node = testutil.arvados_node_mock(4, age=9000)
        size = testutil.MockSize(4)
        self.make_daemon(arvados_nodes=[arv_node])
        self.daemon.update_server_wishlist([size]).get(self.TIMEOUT)
        self.last_setup.cloud_node.get.return_value = (
            testutil.cloud_node_mock(5))
        self.daemon.node_up(self.last_setup).get(self.TIMEOUT)
        # Now deliver the "shutdown unpaired node" message.
        self.timer.deliver()
        self.daemon.update_server_wishlist([size]).get(self.TIMEOUT)
        self.stop_proxy(self.daemon)
        used_nodes = [call[1].get('arvados_node')
                      for call in self.node_setup.start.call_args_list]
        self.assertEqual(2, len(used_nodes))
        self.assertIs(arv_node, used_nodes[0])
        self.assertIs(arv_node, used_nodes[1])

As of this writing, the last assertion fails.

Actions #2

Updated by Brett Smith almost 9 years ago

  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2015-05-20 sprint
Actions #3

Updated by Brett Smith almost 9 years ago

5736-node-manager-easy-slot-cleanup-wip is up for review. It teaches Node Manager to clean up node records after shutting down paired nodes. This case is easy to address, and should help prevent slot number exhaustion under normal operation.

It does not clean up records when nodes fail to bootstrap, so it does not completely solve the problem, or close this ticket. But it gets us most of the benefit and is much easier to do, so Ward asked for it to be addressed separately sooner, and that seemed sensible to me.

Actions #4

Updated by Ward Vandewege almost 9 years ago

Test run clean, LGTM.

Actions #5

Updated by Brett Smith almost 9 years ago

  • Target version changed from 2015-05-20 sprint to Arvados Future Sprints

Addressing the harder case (where we boot a node, it fails to bootstrap, and we need to recycle the node record we seeded it with) can come later.

Actions #6

Updated by Tom Morris over 7 years ago

  • Assigned To changed from Brett Smith to Tom Morris
Actions #7

Updated by Tom Morris about 7 years ago

  • Assigned To deleted (Tom Morris)
Actions #8

Updated by Ward Vandewege over 3 years ago

  • Status changed from New to Closed
Actions #9

Updated by Ward Vandewege over 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions

Also available in: Atom PDF