Project

General

Profile

Actions

Bug #4751

closed

[Node Manager] Can erroneously pair cloud nodes with stale Arvados node records

Added by Brett Smith over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
0.5

Description

Node Manager pairs cloud nodes with Arvados node records based solely on an IP address match. See arvnodeman.computenode.dispatch.ComputeNodeMonitorActor.offer_arvados_pair.

It can happen that a cloud node comes up with an IP address that happens to match a stale Arvados node record. Make the testing stricter so there's no pairing in this case.


Subtasks 1 (0 open1 closed)

Task #5351: Review 4751-node-manager-stricter-node-pairing-wipResolvedPeter Amstutz03/02/2015Actions

Related issues

Related to Arvados - Support #5251: [Support] Fix bugs and write tests (first half)ResolvedBrett SmithActions
Has duplicate Arvados - Bug #4891: [Node Manager] Should not associate node with incorrect arvados node objectClosedActions
Has duplicate Arvados - Bug #5292: [Node Manager] Failed to recognize busy node on qr1hiClosed02/23/2015Actions
Copied from Arvados - Idea #4293: [Node Manager] Write off cloud nodes that spend too long in booted stateResolvedBrett Smith10/27/2014Actions
Actions #1

Updated by Brett Smith over 9 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 9 years ago

  • Story points set to 0.5
Actions #3

Updated by Tom Clegg over 9 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints
Actions #4

Updated by Brett Smith over 9 years ago

I think there are basically two possible approaches:

  • EC2 compute nodes, at least, put their EC2 id in the Arvados node record's info. If we check against that, we can't go wrong—but it has the downside of meaning we have to reimplement this check for every cloud driver.
  • Check that the Arvados node's first_ping_at is greater than the cloud node's boot time before accepting a pairing. This is completely generic, and very safe, although it could still go wrong if the total garbage data is getting into the node records.

I think I prefer #2, but I wanted to note the alternatives at least.

Actions #5

Updated by Brett Smith about 9 years ago

  • Target version changed from Arvados Future Sprints to 2015-03-11 sprint

Moving to current sprint because it came up again during science support, and it's likely to become more pressing now that we've increased our max_nodes setting.

Actions #6

Updated by Peter Amstutz about 9 years ago

I feel like this came up in an earlier code review (discussing the pitfalls of reusing computed node records generally) so it's good to tighten up the check.

4751-node-manager-stricter-node-pairing-wip LGTM

Actions #7

Updated by Brett Smith about 9 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:6be95f5c3a2fcbe6321bba52c20393060e33e637.

Actions #8

Updated by Brett Smith about 9 years ago

  • Assigned To set to Brett Smith
Actions

Also available in: Atom PDF