Project

General

Profile

Actions

Bug #4891

closed

[Node Manager] Should not associate node with incorrect arvados node object

Added by Ward Vandewege almost 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

When a stale node record exists that happens to match the ip address of a newly spun up node, node manager immediately associates that stale record with the newly spun up node. This is because it only tests for a matching ip_address field on line 327 in services/arvnodeman/computenode/dispatch/__init__.py

Here's a log example of the incorrect behavior:

2014-12-30_17:56:50.46464 2014-12-30 17:56:50 arvnodeman.nodeup[15616] INFO: Cloud node i-bf491f53 created.
2014-12-30_17:56:50.53332 2014-12-30 17:56:50 arvnodeman.nodeup[15616] WARNING: Client error: InvalidInstanceID.NotFound: The instance ID 'i-bf491f53' does not exist - waiting 1 seconds
2014-12-30_17:56:55.35531 2014-12-30 17:56:55 arvnodeman.nodeup[15616] INFO: i-bf491f53 post-create work done.
2014-12-30_17:56:55.35761 2014-12-30 17:56:55 arvnodeman.computenode[15616] DEBUG: Node i-bf491f53 shutdown window closed.  Next at Tue Dec 30 18:50:50 2014.
2014-12-30_17:56:55.35878 2014-12-30 17:56:55 arvnodeman.cloud_nodes[15616] DEBUG: <pykka.proxy._CallableProxy object at 0x3a7b0a10> subscribed to events for 'i-bf491f53'
2014-12-30_17:58:49.35145 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Registering new cloud node i-bf491f53
2014-12-30_17:58:49.35365 2014-12-30 17:58:49 arvnodeman.daemon[15616] INFO: Cloud node i-bf491f53 has associated with Arvados node 4xphq-7ekkf-2lb4wfedjanwbnr

Note how it immediately associates with 4xphq-7ekkf-2lb4wfedjanwbnr (compute0) after registering the new cloud node. The correct node object it should have associated with was 4xphq-7ekkf-48hebl9u9bc12fv (compute1), but that can't happen until after the first node ping.

Here's a normal example for when there is no IP address match, which means that the association doesn't happen until after the first ping of the new node:

@4000000054a244792924b124.s:2014-12-29_23:52:37.27175 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: Cloud node i-b4eebb58 created.
@4000000054a244792924b124.s:2014-12-29_23:52:37.44820 2014-12-29 23:52:37 arvnodeman.nodeup[27560] INFO: i-b4eebb58 post-create work done.
@4000000054a244792924b124.s:2014-12-29_23:52:37.45057 2014-12-29 23:52:37 arvnodeman.computenode[27560] DEBUG: Node i-b4eebb58 shutdown window closed.  Next at Tue Dec 30 00:46:36 2014.
@4000000054a244792924b124.s:2014-12-29_23:52:37.45147 2014-12-29 23:52:37 arvnodeman.cloud_nodes[27560] DEBUG: <pykka.proxy._CallableProxy object at 0x32c5dd50> subscribed to events for 'i-b4eebb58'
@4000000054a244792924b124.s:2014-12-29_23:53:35.72121 2014-12-29 23:53:35 arvnodeman.daemon[27560] INFO: Registering new cloud node i-b4eebb58
@4000000054a244792924b124.s:2014-12-29_23:55:35.42040 2014-12-29 23:55:35 arvnodeman.daemon[27560] INFO: Cloud node i-b4eebb58 has associated with Arvados node qr1hi-7ekkf-09hjulgcrpxp1iw

Related: in issue #4887, some code is added to remove duplicate IPs from old node records when a new node pings. But that's too late to help with this problem.


Related issues

Related to Arvados - Bug #4887: [API] Make sure the local dns records for old compute nodes are updated if there's a conflict with a new compute nodeResolvedWard Vandewege12/30/2014Actions
Is duplicate of Arvados - Bug #4751: [Node Manager] Can erroneously pair cloud nodes with stale Arvados node recordsResolvedBrett Smith03/02/2015Actions
Actions #1

Updated by Ward Vandewege almost 10 years ago

  • Description updated (diff)
Actions #2

Updated by Ward Vandewege almost 10 years ago

  • Description updated (diff)
Actions #3

Updated by Ward Vandewege almost 10 years ago

  • Subject changed from [Node Manager] Should not associated with incorrect arvados node object to [Node Manager] Should not associate node with incorrect arvados node object
Actions #4

Updated by Brett Smith almost 10 years ago

  • Status changed from New to Closed

Duplicates #4751.

Actions #5

Updated by Brett Smith almost 10 years ago

  • Target version deleted (Bug Triage)
Actions

Also available in: Atom PDF