Bug #4891

[Node Manager] Should not associate node with incorrect arvados node object

Added by Ward Vandewege almost 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

When a stale node record exists that happens to match the ip address of a newly spun up node, node manager immediately associates that stale record with the newly spun up node. This is because it only tests for a matching ip_address field on line 327 in services/arvnodeman/computenode/dispatch/__init__.py

Here's a log example of the incorrect behavior:

2014-12-30_17:56:50.46464 2014-12-30 17:56:50 arvnodeman.nodeup[15616] INFO: Cloud node i-bf491f53 created.
2014-12-30_17:56:50.53332 2014-12-30 17:56:50 arvnodeman.nodeup[15616] WARNING: Client error: InvalidInstanceID.NotFound: The instance ID 'i-bf491f53' does not exist - waiting 1 seconds
2014-12-30_17:56:55.35531 2014-12-30 17:56:55 arvnodeman.nodeup[15616] INFO: i-bf491f53 post-create work done.
2014-12-30_17:56:55.35761 2014-12-30 17:56:55 arvnodeman.computenode[15616] DEBUG: Node i-bf491f53 shutdown window closed.  Next at Tue Dec 30 18:50:50 2014.
2014-12-30_17:56:55.35878 2014-12-30 17:56:55 arvnodeman.cloud_nodes[15616] DEBUG: <pykka.proxy._CallableProxy object at 0x3a7b0a10> subscribed to events for 'i-bf491f53'
2014-12-30_17:58:49.35145 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Registering new cloud node i-bf491f53
2014-12-30_17:58:49.35365 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Cloud node i-bf491f53 has associated with Arvados node 4xphq-7ekkf-2lb4wfedjanwbnr

Note how it immediately associates with 4xphq-7ekkf-2lb4wfedjanwbnr (compute0) after registering the new cloud node. The correct node object it should have associated with was 4xphq-7ekkf-48hebl9u9bc12fv (compute1), but that can't happen until after the first node ping.

Here's a normal example for when there is no IP address match, which means that the association doesn't happen until after the first ping of the new node:

@4000000054a244792924b124.s:2014-12-29_23:52:37.27175 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: Cloud node i-b4eebb58 created.
@4000000054a244792924b124.s:2014-12-29_23:52:37.44820 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: i-b4eebb58 post-create work done.
@4000000054a244792924b124.s:2014-12-29_23:52:37.45057 2014-12-29 23:52:37 arvnodeman.computenode[27560] DEBUG: Node i-b4eebb58 shutdown window closed.  Next at Tue Dec 30 00:46:36 2014.
@4000000054a244792924b124.s:2014-12-29_23:52:37.45147 2014-12-29 23:52:37 arvnodeman.cloud_nodes[27560] DEBUG: <pykka.proxy._CallableProxy object at 0x32c5dd50> subscribed to events for 'i-b4eebb58'
@4000000054a244792924b124.s:2014-12-29_23:53:35.72121 2014-12-29 23:53:35 arvnodeman.daemon[27560] INFO: Registering new cloud node i-b4eebb58
@4000000054a244792924b124.s:2014-12-29_23:55:35.42040 2014-12-29 23:55:35 arvnodeman.daemon[27560] INFO: Cloud node i-b4eebb58 has associated with Arvados node qr1hi-7ekkf-09hjulgcrpxp1iw

Related: in issue #4887, some code is added to remove duplicate IPs from old node records when a new node pings. But that's too late to help with this problem.


Related issues

Related to Arvados - Bug #4887: [API] Make sure the local dns records for old compute nodes are updated if there's a conflict with a new compute nodeResolved12/30/2014

Is duplicate of Arvados - Bug #4751: [Node Manager] Can erroneously pair cloud nodes with stale Arvados node recordsResolved03/02/2015

History

#1 Updated by Ward Vandewege almost 7 years ago

  • Description updated (diff)

#2 Updated by Ward Vandewege almost 7 years ago

  • Description updated (diff)

#3 Updated by Ward Vandewege almost 7 years ago

  • Subject changed from [Node Manager] Should not associated with incorrect arvados node object to [Node Manager] Should not associate node with incorrect arvados node object

#4 Updated by Brett Smith almost 7 years ago

  • Status changed from New to Closed

Duplicates #4751.

#5 Updated by Brett Smith almost 7 years ago

  • Target version deleted (Bug Triage)

Also available in: Atom PDF