Bug #5736

[Node Manager] Reuse node records after shutting down the cloud node set up with them

Added by Brett Smith over 4 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
Start date:
05/11/2015
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
1.0

Description

When Node Manager uses an Arvados node record to set up a new compute node, it records the time that setup starts. It won't reuse this record for another setup for node_stale_time seconds, even if the setup is aborted because the cloud node is no longer needed, or fails to pair with its Arvados node. It would help if Node Manager could reuse those records faster; we just ran into an issue where there was slot_number exhaustion on a cluster because it wasn't reusing records for this reason.

This is non-trivial, because Node Manager retains no memory of what Arvados node record was used to set up a compute node. ComputeNodeMonitorActor can be initialized with an arvados_node, but the daemon is relying on that being initialized as None, and set during the pairing process, to detect unpaired nodes. Implementing this will require more involved state tracking.

Conflicts with #4129.


Subtasks

Task #5995: Review 5736-node-manager-easy-slot-cleanup-wipResolvedWard Vandewege


Related issues

Related to Arvados - Bug #4129: [Node Manager] Don't reuse node table recordsNew

Associated revisions

Revision f116df2e
Added by Brett Smith over 4 years ago

Merge branch '5736-node-manager-easy-slot-cleanup-wip'

Refs #5736. Closes #5995.

History

#1 Updated by Brett Smith over 4 years ago

If it helps, here's one daemon test I wrote for the behavior.

   def test_old_arvados_node_reused_again_after_shutdown(self):
        arv_node = testutil.arvados_node_mock(4, age=9000)
        size = testutil.MockSize(4)
        self.make_daemon(arvados_nodes=[arv_node])
        self.daemon.update_server_wishlist([size]).get(self.TIMEOUT)
        self.last_setup.cloud_node.get.return_value = (
            testutil.cloud_node_mock(5))
        self.daemon.node_up(self.last_setup).get(self.TIMEOUT)
        # Now deliver the "shutdown unpaired node" message.
        self.timer.deliver()
        self.daemon.update_server_wishlist([size]).get(self.TIMEOUT)
        self.stop_proxy(self.daemon)
        used_nodes = [call[1].get('arvados_node')
                      for call in self.node_setup.start.call_args_list]
        self.assertEqual(2, len(used_nodes))
        self.assertIs(arv_node, used_nodes[0])
        self.assertIs(arv_node, used_nodes[1])

As of this writing, the last assertion fails.

#2 Updated by Brett Smith over 4 years ago

  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2015-05-20 sprint

#3 Updated by Brett Smith over 4 years ago

5736-node-manager-easy-slot-cleanup-wip is up for review. It teaches Node Manager to clean up node records after shutting down paired nodes. This case is easy to address, and should help prevent slot number exhaustion under normal operation.

It does not clean up records when nodes fail to bootstrap, so it does not completely solve the problem, or close this ticket. But it gets us most of the benefit and is much easier to do, so Ward asked for it to be addressed separately sooner, and that seemed sensible to me.

#4 Updated by Ward Vandewege over 4 years ago

Test run clean, LGTM.

#5 Updated by Brett Smith over 4 years ago

  • Target version changed from 2015-05-20 sprint to Arvados Future Sprints

Addressing the harder case (where we boot a node, it fails to bootstrap, and we need to recycle the node record we seeded it with) can come later.

#6 Updated by Tom Morris almost 3 years ago

  • Assigned To changed from Brett Smith to Tom Morris

#7 Updated by Tom Morris over 2 years ago

  • Assigned To deleted (Tom Morris)

Also available in: Atom PDF