Bug #4891

Updated by Ward Vandewege almost 7 years ago

When a stale node record exists that happens to match the ip address of a newly spun up node, node manager immediately associates that stale record with the newly spun up node. This is because it only tests for a matching ip_address field on line 327 in services/arvnodeman/computenode/dispatch/__init__.py

Here's a log example of the incorrect behavior:

2014-12-30_17:56:50.46464 2014-12-30 17:56:50 arvnodeman.nodeup[15616] INFO: Cloud node i-bf491f53 created.
2014-12-30_17:56:50.53332 2014-12-30 17:56:50 arvnodeman.nodeup[15616] WARNING: Client error: InvalidInstanceID.NotFound: The instance ID 'i-bf491f53' does not exist - waiting 1 seconds
2014-12-30_17:56:55.35531 2014-12-30 17:56:55 arvnodeman.nodeup[15616] INFO: i-bf491f53 post-create work done.
2014-12-30_17:56:55.35761 2014-12-30 17:56:55 arvnodeman.computenode[15616] DEBUG: Node i-bf491f53 shutdown window closed. Next at Tue Dec 30 18:50:50 2014.
2014-12-30_17:56:55.35878 2014-12-30 17:56:55 arvnodeman.cloud_nodes[15616] DEBUG: <pykka.proxy._CallableProxy object at 0x3a7b0a10> subscribed to events for 'i-bf491f53'
2014-12-30_17:58:49.35145 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Registering new cloud node i-bf491f53
2014-12-30_17:58:49.35365 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Cloud node i-bf491f53 has associated with Arvados node 4xphq-7ekkf-2lb4wfedjanwbnr

Note how it immediately associates with 4xphq-7ekkf-2lb4wfedjanwbnr (compute0) after registering the new cloud node. The correct node object it should have associated with was 4xphq-7ekkf-48hebl9u9bc12fv (compute1), but that can't happen until after the first node ping.

Here's a normal example for when there is no IP address match, which means that the association doesn't happen until after the first ping of the new node:

@4000000054a244792924b124.s:2014-12-29_23:52:37.27175 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: Cloud node i-b4eebb58 created.
@4000000054a244792924b124.s:2014-12-29_23:52:37.44820 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: i-b4eebb58 post-create work done.
@4000000054a244792924b124.s:2014-12-29_23:52:37.45057 2014-12-29 23:52:37 arvnodeman.computenode[27560] DEBUG: Node i-b4eebb58 shutdown window closed. Next at Tue Dec 30 00:46:36 2014.
@4000000054a244792924b124.s:2014-12-29_23:52:37.45147 2014-12-29 23:52:37 arvnodeman.cloud_nodes[27560] DEBUG: <pykka.proxy._CallableProxy object at 0x32c5dd50> subscribed to events for 'i-b4eebb58'
@4000000054a244792924b124.s:2014-12-29_23:53:35.72121 2014-12-29 23:53:35 arvnodeman.daemon[27560] INFO: Registering new cloud node i-b4eebb58
@4000000054a244792924b124.s:2014-12-29_23:55:35.42040 2014-12-29 23:55:35 arvnodeman.daemon[27560] INFO: Cloud node i-b4eebb58 has associated with Arvados node qr1hi-7ekkf-09hjulgcrpxp1iw