Story #4293

[Node Manager] Write off cloud nodes that spend too long in booted state

Added by Brett Smith about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Node Manager
Target version:
Start date:
10/27/2014
Due date:
% Done:

100%

Estimated time:
(Total: 2.00 h)
Story points:
1.5

Description

If the cloud has an internal error starting a node, Node Manager won't shut it down until the normal shutdown window opens. There should be a separate timer for this case: if a cloud node doesn't appear in the node listing within N minutes (configurable), assume it failed to start, and shut it down.


Subtasks

Task #4322: [Node Manager] Should not pair cloud and Arvados nodes immediately after bootingResolvedBrett Smith

Task #4732: Review 4293-node-manager-timed-bootstrap-wipResolvedPeter Amstutz


Related issues

Copied to Arvados - Bug #4751: [Node Manager] Can erroneously pair cloud nodes with stale Arvados node recordsResolved03/02/2015

Associated revisions

Revision 8141501a
Added by Brett Smith almost 6 years ago

Merge branch '4293-node-manager-timed-bootstrap-wip'

Closes #4293, #4732. Refs #4380.

History

#1 Updated by Brett Smith about 6 years ago

  • Story points set to 1.0

#2 Updated by Brett Smith about 6 years ago

  • Target version set to Bug Triage

#3 Updated by Brett Smith about 6 years ago

  • Tracker changed from Bug to Story

#4 Updated by Ward Vandewege about 6 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints

#5 Updated by Ward Vandewege about 6 years ago

  • Target version changed from Arvados Future Sprints to 2014-11-19 sprint

#6 Updated by Brett Smith about 6 years ago

  • Assigned To set to Brett Smith

#7 Updated by Brett Smith about 6 years ago

  • Story points changed from 1.0 to 1.5

Fixing this effectively requires fixing #4322, which was originally filed separately. Adjusting this story to account for this.

#8 Updated by Brett Smith about 6 years ago

  • Target version changed from 2014-11-19 sprint to 2014-12-10 sprint

#9 Updated by Brett Smith almost 6 years ago

  • Status changed from New to In Progress

#10 Updated by Peter Amstutz almost 6 years ago

A few comments:

  • It's a little confusing to have "cloud_nodes", "booting", "booted", and "shutdowns", where the state of the node depends on which collection it is in. (From the comments I see that the different dicts don't quite hold the same thing, so maybe it's justified, but it seems more complex than representing a node as a single record whose state changes over time.) In particular, this is confusing, as it is not obvious why the "cloud_nodes" and "booted" sets should be disjoint:
    for record_dict in [self.cloud_nodes, self.booted]:
    
  • Arvados nodes get paired with cloud nodes based on IP address. Is it possible that a (reused) Arvados node record could have a stale IP address and get a bogus pairing because the compute node ip address gets reused?
  • Is there a race condition if the node starts talking to Arvados after the "node is taking too long" shutdown is initiated?

#11 Updated by Brett Smith almost 6 years ago

Peter Amstutz wrote:

A few comments:

  • It's a little confusing to have "cloud_nodes", "booting", "booted", and "shutdowns", where the state of the node depends on which collection it is in. (From the comments I see that the different dicts don't quite hold the same thing, so maybe it's justified, but it seems more complex than representing a node as a single record whose state changes over time.) In particular, this is confusing, as it is not obvious why the "cloud_nodes" and "booted" sets should be disjoint:
    [...]

"booted" nodes are ones that have finished going through the setup process, but haven't appeared in the listing of cloud nodes yet (i.e., we're waiting for eventual consistency). "cloud_nodes" have appeared in a listing. Since this method fires on a timer, we don't know how far in the boot process it's gotten, so we need to look for it in both places.

I agree that the daemon has gotten hairier than I'd really like, and I'd like to have an excuse to clean it up. But self.booted was added in an earlier branch, and there's no reason to deal with it in this one.

  • Arvados nodes get paired with cloud nodes based on IP address. Is it possible that a (reused) Arvados node record could have a stale IP address and get a bogus pairing because the compute node ip address gets reused?

No, because Node Manager clears the IP address and other fields before reusing the record. See ComputeNodeSetupActor.prepare_arvados_node.

  • Is there a race condition if the node starts talking to Arvados after the "node is taking too long" shutdown is initiated?

Within Node Manager itself, no. It will stop the corresponding ComputeNodeMonitorActor when the node disappears from the cloud listing, regardless of the node's state, and not before. So if the node pairs with an Arvados node in a later update, the ComputeNodeMonitorActor will successfully receive that update, then be shutdown later when its shutdown registers in the cloud node listing.

In the larger Arvados context, it's possible that Arvados (e.g., Crunch) will try to start working with the node in between the time it pairs and the time it's shut down, but I think Crunch has to be responsible for dealing with that kind of failure. Node Manager can't tell Arvados anything about the shutdown, because the what creates this situation is that there's no meaningful record of the node in Arvados to talk about.

#12 Updated by Peter Amstutz almost 6 years ago

Brett Smith wrote:

No, because Node Manager clears the IP address and other fields before reusing the record. See ComputeNodeSetupActor.prepare_arvados_node.

I was actually thinking of stale records in the nodes table, but on further thought presumably the arvados_nodes list in NodeManager only includes the records where the last ping time is up to date.

The rest of it looks good to me.

#13 Updated by Brett Smith almost 6 years ago

Peter Amstutz wrote:

Brett Smith wrote:

No, because Node Manager clears the IP address and other fields before reusing the record. See ComputeNodeSetupActor.prepare_arvados_node.

I was actually thinking of stale records in the nodes table, but on further thought presumably the arvados_nodes list in NodeManager only includes the records where the last ping time is up to date.

It doesn't. You're right, there is a bug here. But it predates this branch and can happen even when nodes come up from sources outside of Node Manager's control. Created #4751 to track this.

#14 Updated by Brett Smith almost 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:8141501a6ef0a3cf4f40da14671c31c0257472e4.

Also available in: Atom PDF