Project

General

Profile

Bug #4891

Updated by Ward Vandewege over 9 years ago

 
 When a stale node record exists that happens to match the ip address of a newly spun up node, node manager immediately associates that stale record with the newly spun up node. This is because it only tests for a matching ip_address field on line 327 in services/arvnodeman/computenode/dispatch/__init__.py 

 Here's a log example of the incorrect behavior: 

 <pre> 
 2014-12-30_17:56:50.46464 2014-12-30 17:56:50 arvnodeman.nodeup[15616] INFO: Cloud node i-bf491f53 created. 
 2014-12-30_17:56:50.53332 2014-12-30 17:56:50 arvnodeman.nodeup[15616] WARNING: Client error: InvalidInstanceID.NotFound: The instance ID 'i-bf491f53' does not exist - waiting 1 seconds 
 2014-12-30_17:56:55.35531 2014-12-30 17:56:55 arvnodeman.nodeup[15616] INFO: i-bf491f53 post-create work done. 
 2014-12-30_17:56:55.35761 2014-12-30 17:56:55 arvnodeman.computenode[15616] DEBUG: Node i-bf491f53 shutdown window closed.    Next at Tue Dec 30 18:50:50 2014. 
 2014-12-30_17:56:55.35878 2014-12-30 17:56:55 arvnodeman.cloud_nodes[15616] DEBUG: <pykka.proxy._CallableProxy object at 0x3a7b0a10> subscribed to events for 'i-bf491f53' 
 2014-12-30_17:58:49.35145 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Registering new cloud node i-bf491f53 
 2014-12-30_17:58:49.35365 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Cloud node i-bf491f53 has associated with Arvados node 4xphq-7ekkf-2lb4wfedjanwbnr 
 </pre> 

 Note how it immediately associates with 4xphq-7ekkf-2lb4wfedjanwbnr (compute0) after registering the new cloud node. The correct node object it should have associated with was 4xphq-7ekkf-48hebl9u9bc12fv (compute1), but that can't happen until after the first node ping. 

 Here's a normal example for when there is no IP address match, which means that the association doesn't happen until after the first ping of the new node: 

 <pre> 
 @4000000054a244792924b124.s:2014-12-29_23:52:37.27175 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: Cloud node i-b4eebb58 created. 
 @4000000054a244792924b124.s:2014-12-29_23:52:37.44820 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: i-b4eebb58 post-create work done. 
 @4000000054a244792924b124.s:2014-12-29_23:52:37.45057 2014-12-29 23:52:37 arvnodeman.computenode[27560] DEBUG: Node i-b4eebb58 shutdown window closed.    Next at Tue Dec 30 00:46:36 2014. 
 @4000000054a244792924b124.s:2014-12-29_23:52:37.45147 2014-12-29 23:52:37 arvnodeman.cloud_nodes[27560] DEBUG: <pykka.proxy._CallableProxy object at 0x32c5dd50> subscribed to events for 'i-b4eebb58' 
 @4000000054a244792924b124.s:2014-12-29_23:53:35.72121 2014-12-29 23:53:35 arvnodeman.daemon[27560] INFO: Registering new cloud node i-b4eebb58 
 @4000000054a244792924b124.s:2014-12-29_23:55:35.42040 2014-12-29 23:55:35 arvnodeman.daemon[27560] INFO: Cloud node i-b4eebb58 has associated with Arvados node qr1hi-7ekkf-09hjulgcrpxp1iw 
 </pre> 

 Related: in issue #4887, some code is added to remove duplicate IPs from old node records when a new node pings. But that's too late to help with this problem.

Back