Story #8561

[Node Manager] Pair nodes based on ec2_instance_id rather than IP address

Added by Brett Smith over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Radhika Chippada
Category:
Node Manager
Target version:
Start date:
02/26/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Our ping script in cloud nodes ensure that compute nodes set their "ec2_instance_id" when they ping Arvados. This is a unique identifier for the node in the cloud, although exactly where it comes from in the API varies across this cloud. It is stored in the info field of the Arvados node record.

Node Manager currently compares the IP address of the cloud node record and the Arvados node record to pair nodes. Determining a node's IP address requires an extra API call in Azure, and is relatively expensive. Instead, Node Manager should compare the cloud node's unique identifier with the Arvados node's ec2_instance_id. This may require a new method on cloud node drivers to account for differences across clouds (e.g., on one cloud the identifier is "id" and on another it's "name").

  • Add a new class method node_id(cls, node) to the cloud node drivers in arvnodeman/computenode/driver.
    • In the base driver in __init__.py, this should raise NotImplementedError.
    • In the EC2 and GCE drivers, this returns node.id.
    • In the Azure driver, this returns node.name.
  • In arvnodeman/computenode/dispatch/__init__.py, in the offer_arvados_pair method, replace the condition (arvados_node['ip_address'] in self.cloud_node.private_ips) with (arvados_node['info'].get('ec2_instance_id') == self._cloud.node_id(self.cloud_node)).

Set the flag for the Azure libcloud driver to tell it that it no longer needs to query node IP addresses, to improve the performance of Node Manager on Azure.

  • In Node Manager's Azure compute node driver's list_nodes method, when we call super(ComputeNodeDriver, self).list_nodes(), add the argument ex_fetch_nic=False to list_nodes().

Subtasks

Task #8606: Review branch: 8561-node-pairingResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #8385: [NodeManager] Getting node list on Azure takes a long timeDuplicate

Associated revisions

Revision 5768e5a7
Added by Radhika Chippada over 5 years ago

closes #8561
Merge branch '8561-node-pairing'

History

#1 Updated by Brett Smith over 5 years ago

  • Target version set to Arvados Future Sprints

#2 Updated by Brett Smith over 5 years ago

  • Target version changed from Arvados Future Sprints to 2016-03-16 sprint

#3 Updated by Brett Smith over 5 years ago

  • Story points set to 1.0

#4 Updated by Brett Smith over 5 years ago

  • Assigned To set to Radhika Chippada

Brett to detail the necessary changes.

#5 Updated by Brett Smith over 5 years ago

  • Description updated (diff)

#6 Updated by Radhika Chippada over 5 years ago

Brett: d43df73d4429ddd4c00c7ff47e10f9f2595d30b0 implements the steps in the story description.

However, 9 tests are failing. They are failing because arvados_node['info'].get('ec2_instance_id') is None in the test. Do we need to update the testutil.py to set this to node.id? I played with setting it in cloud_node_mock, but this is not the right place / manner. Thanks.

#7 Updated by Brett Smith over 5 years ago

Radhika Chippada wrote:

However, 9 tests are failing. They are failing because arvados_node['info'].get('ec2_instance_id') is None in the test. Do we need to update the testutil.py to set this to node.id? I played with setting it in cloud_node_mock, but this is not the right place / manner. Thanks.

You'll need to set it in arvados_node_mock, inside the node['info'] dictionary. The same place where ping_secret is currently set.

#8 Updated by Radhika Chippada over 5 years ago

You'll need to set it in arvados_node_mock, inside the node['info'] dictionary. The same place where ping_secret is currently set.

i had already tried this, but the issue I am having is what do I set 'ec2_instance_id' as? If I use node id (node_num), I am getting '2' for this where as node.id is '140130973094224'

How do I determine the right value for ec2_instance_id? Thanks.

#9 Updated by Brett Smith over 5 years ago

First, a bit of background: the Node Manager code is pretty consistent about using "arvados_node" and "cloud_node" names. arvados_node refers to a Node record that came from the Arvados API server, while cloud node refers to information about a node that came from the local cloud API.

In each pairing test, you'll need to ensure that ec2_instance_id matches the right attribute on the mock cloud node object that the test creates. On EC2 and GCE, the right attribute is the cloud node's id attribute. On Azure, it's the cloud node's name attribute. (Because of this split, you'll probably need to write a separate test for each cloud driver.)

As long as you make sure those fields line up in the mock objects before the tests start calling actual Node Manager code, you should be good to go.

#10 Updated by Radhika Chippada over 5 years ago

  • Status changed from New to In Progress

(After much hair pulling) I learned that the issue is with our mock test setup. We need to add "self.cloud_factory().node_id.return_value" to test_daemon.py

Making progress ...

#11 Updated by Radhika Chippada over 5 years ago

Branch 8561-node-pairing b720184c408dee0e3e9bbafb424bd736703be172

  • Made the code updates suggested in the description.
  • There were 9 test failures that needed some test setup updates.
    • Needed to add "self.cloud_factory().node_id.return_value" to test_daemon.py. I had hardcoded '2' for this (for now). Does it need to be dynamic using node_num or something? I could not tell how to get "node_num" in this context. Please clarify. Down to 2 test failures with this.
  • Please shed light on how to get the other two failing tests to pass: test_node_pairing_after_arvados_update from test_daemon and test_arvados_node_match from test_computenode_dispatch

Thanks.

#12 Updated by Radhika Chippada over 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:5768e5a7559c22f86d89be65245b0d269f06cc6c.

Also available in: Atom PDF