Bug #4891
closed[Node Manager] Should not associate node with incorrect arvados node object
Description
When a stale node record exists that happens to match the ip address of a newly spun up node, node manager immediately associates that stale record with the newly spun up node. This is because it only tests for a matching ip_address field on line 327 in services/arvnodeman/computenode/dispatch/__init__.py
Here's a log example of the incorrect behavior:
2014-12-30_17:56:50.46464 2014-12-30 17:56:50 arvnodeman.nodeup[15616] INFO: Cloud node i-bf491f53 created. 2014-12-30_17:56:50.53332 2014-12-30 17:56:50 arvnodeman.nodeup[15616] WARNING: Client error: InvalidInstanceID.NotFound: The instance ID 'i-bf491f53' does not exist - waiting 1 seconds 2014-12-30_17:56:55.35531 2014-12-30 17:56:55 arvnodeman.nodeup[15616] INFO: i-bf491f53 post-create work done. 2014-12-30_17:56:55.35761 2014-12-30 17:56:55 arvnodeman.computenode[15616] DEBUG: Node i-bf491f53 shutdown window closed. Next at Tue Dec 30 18:50:50 2014. 2014-12-30_17:56:55.35878 2014-12-30 17:56:55 arvnodeman.cloud_nodes[15616] DEBUG: <pykka.proxy._CallableProxy object at 0x3a7b0a10> subscribed to events for 'i-bf491f53' 2014-12-30_17:58:49.35145 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Registering new cloud node i-bf491f53 2014-12-30_17:58:49.35365 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Cloud node i-bf491f53 has associated with Arvados node 4xphq-7ekkf-2lb4wfedjanwbnr
Note how it immediately associates with 4xphq-7ekkf-2lb4wfedjanwbnr (compute0) after registering the new cloud node. The correct node object it should have associated with was 4xphq-7ekkf-48hebl9u9bc12fv (compute1), but that can't happen until after the first node ping.
Here's a normal example for when there is no IP address match, which means that the association doesn't happen until after the first ping of the new node:
@4000000054a244792924b124.s:2014-12-29_23:52:37.27175 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: Cloud node i-b4eebb58 created. @4000000054a244792924b124.s:2014-12-29_23:52:37.44820 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: i-b4eebb58 post-create work done. @4000000054a244792924b124.s:2014-12-29_23:52:37.45057 2014-12-29 23:52:37 arvnodeman.computenode[27560] DEBUG: Node i-b4eebb58 shutdown window closed. Next at Tue Dec 30 00:46:36 2014. @4000000054a244792924b124.s:2014-12-29_23:52:37.45147 2014-12-29 23:52:37 arvnodeman.cloud_nodes[27560] DEBUG: <pykka.proxy._CallableProxy object at 0x32c5dd50> subscribed to events for 'i-b4eebb58' @4000000054a244792924b124.s:2014-12-29_23:53:35.72121 2014-12-29 23:53:35 arvnodeman.daemon[27560] INFO: Registering new cloud node i-b4eebb58 @4000000054a244792924b124.s:2014-12-29_23:55:35.42040 2014-12-29 23:55:35 arvnodeman.daemon[27560] INFO: Cloud node i-b4eebb58 has associated with Arvados node qr1hi-7ekkf-09hjulgcrpxp1iw
Related: in issue #4887, some code is added to remove duplicate IPs from old node records when a new node pings. But that's too late to help with this problem.
Updated by Ward Vandewege almost 10 years ago
- Subject changed from [Node Manager] Should not associated with incorrect arvados node object to [Node Manager] Should not associate node with incorrect arvados node object