Bug #7286

[Node Manager] Should recognize and shut down broken nodes

Added by Brett Smith over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
09/10/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

"Broken" means the cloud provider asserts that it is broken, and one of the following is true:

  • The cloud node is unpaired, and at least boot_fail_after seconds old.
  • The cloud node is paired, and the associated Arvados record has status "missing".

Steps:

  • Add a method to node drivers that takes a cloud node record as an argument. It returns True if the record indicates the node is broken, False otherwise.
  • ComputeNodeMonitorActor suggests its node for shutdown if this new method returns True, and one of the conditions above is true.
  • Remove the shutdown_unpaired_node logic from NodeManagerDaemonActor, since the point above effectively moves it to the ComputeNodeMonitorActor.
  • Update the daemon's _nodes_wanted/_nodes_excess math so that we boot replacements for nodes in this failed state unless we're at max_nodes. You could do this by simply counting them in _nodes_busy, but please be careful: we don't want to accidentally revert e81a3e9b. The daemon should be able to affirmatively know that the node is completely failed.

Subtasks

Task #7409: Review 7286-nodeman-destroy-broken-nodesResolvedTom Clegg

Task #7460: Deploy arvados-node-manager-0.1.20151006131743 and python-apache-libcloud-0.18.1dev4 on c97qkResolvedNico C├ęsar

Associated revisions

Revision 44b7b4e1 (diff)
Added by Peter Amstutz over 4 years ago

Update pin of libcloud fork to dev4 refs #7286

Revision 44b7b4e1 (diff)
Added by Peter Amstutz over 4 years ago

Update pin of libcloud fork to dev4 refs #7286

Revision 05b52b29
Added by Peter Amstutz over 4 years ago

Merge branch '7286-nodeman-destroy-broken-nodes' closes #7286

Revision fb0f0a8e (diff)
Added by Tom Clegg over 4 years ago

Warn about unhandled case if broken node has no ping time. refs #7286

History

#1 Updated by Peter Amstutz over 4 years ago

We should take into account the node state reported by the cloud provider, specifically the Node.state field which contains the NodeState enum. A node in the "ERROR" state should be destroyed.

#2 Updated by Brett Smith over 4 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Brett Smith over 4 years ago

  • Subject changed from [Node Manager] Should shut down a node when it appears broken: to [Node Manager] Should recognize and shut down broken nodes
  • Description updated (diff)

#4 Updated by Peter Amstutz over 4 years ago

Only nodes reported as PENDING/RUNNING/REBOOTING state should be considered usable, nodes in UNKNOWN/STOPPED/SUSPENDED/PAUSED are unlikely to do useful work for us, only usable nodes should be considered when deciding whether the node request is satisfied, however all nodes regardless of state are considered in the calculation for maximum number of nodes.

#5 Updated by Peter Amstutz over 4 years ago

  • Status changed from New to In Progress
  • Assigned To set to Peter Amstutz

#6 Updated by Peter Amstutz over 4 years ago

  • Is arvados_node["status"] == "missing" a sufficient signal for determining whether the node should be counted as "up" or not? Unlike the decision to shut down the node, this seems fine, because if a node is out of contact but is doing useful work, it won't be affected; if a node is out of contact but idle, it may still be the right thing to do to bring up a new node, since the missing node node can't do work for us.
  • If a node is otherwise running but has "crunch_worker_state" of "down" should it eligible for shutdown?

#7 Updated by Brett Smith over 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-14 sprint

#8 Updated by Brett Smith over 4 years ago

  • Story points set to 1.0

#9 Updated by Tom Clegg over 4 years ago

Currently arvados_node_missing returns None (falsy) if the node has never pinged. If we ever get into a state where we have an arvados_node associated with a node, but it has never pinged, we won't even ask the cloud driver whether it's broken and we'll skip both of the shutdown tests. Perhaps we should do something like:
  • Return True instead of None from arvados_node_missing if the node has never pinged, or
  • change self.arvados_node is Noneself.arvados_node is None or arvados_node_missing(self.arvados_node, self.node_stale_after) is None in the "Node is unpaired" condition

It seems counterintuitive that the Azure driver understands cloud_types.NodeState.UNKNOWN to mean a node is broken. Is this documented? At face value I would expect UNKNOWN to indicate "can't tell you right now because of transient API failure" or something like that -- in that case treating it as "broken" could give us behavior like "once a node passes a certain ping-interval threshold, an unlucky moment's API failure will cause us to kill it" which could be hard to track down. But perhaps we already know we'll see "broken" reported as UNKNOWN? In that case perhaps a log message would be a good idea, in case we start getting UNKNOWN for other reasons?

The rest LGTM.

#10 Updated by Peter Amstutz over 4 years ago

Tom Clegg wrote:

Currently arvados_node_missing returns None (falsy) if the node has never pinged. If we ever get into a state where we have an arvados_node associated with a node, but it has never pinged, we won't even ask the cloud driver whether it's broken and we'll skip both of the shutdown tests. Perhaps we should do something like:
  • Return True instead of None from arvados_node_missing if the node has never pinged, or
  • change self.arvados_node is Noneself.arvados_node is None or arvados_node_missing(self.arvados_node, self.node_stale_after) is None in the "Node is unpaired" condition

Under the current design and assuming there are no bugs that result in outright false information, I don't think it is possible for an arvados node to be associated with a cloud node without a ping, because the ping is what actually establishes the association. Here is the logic that Brett and I agreed on for this ticket:

  1. If a cloud node hasn't pinged after some boot up period, destroy the node (the "missing" logic doesn't apply)
  2. If a cloud node is "missing" (pinged in the past but not recently), but cloud does not say it is "broken", leave it alone
  3. If a cloud node is "missing" and cloud does say it is "broken", destroy the node

If we want to further tweak this logic, we should take it up in another ticket.

It seems counterintuitive that the Azure driver understands cloud_types.NodeState.UNKNOWN to mean a node is broken. Is this documented? At face value I would expect UNKNOWN to indicate "can't tell you right now because of transient API failure" or something like that -- in that case treating it as "broken" could give us behavior like "once a node passes a certain ping-interval threshold, an unlucky moment's API failure will cause us to kill it" which could be hard to track down. But perhaps we already know we'll see "broken" reported as UNKNOWN? In that case perhaps a log message would be a good idea, in case we start getting UNKNOWN for other reasons?

libcloud doesn't provide much guidance on what any of the node states are supposed to mean, so the for the libcloud Azure ARM driver that I wrote, I used UNKNOWN to indicate that the node is in an unrecognized state (the actual Azure node state reported by the API is some undocumented combination of hardware, cloud, disk, agent and ??? provisioning status, so there an inherent problem is mapping an unknown matrix of finer grained states to a handful of coarsely defined libcloud states). Also, if there an API failure, it raises an API exception, it doesn't return a node record in "UNKNOWN" state.

I have updated the branch with comments relating to the above code sections.

#11 Updated by Tom Clegg over 4 years ago

Peter Amstutz wrote:

Under the current design and assuming there are no bugs that result in outright false information, I don't think it is possible for an arvados node to be associated with a cloud node without a ping, because the ping is what actually establishes the association.

OK. I asked because of these three lines, which seem to address specifically this possibility:

+    if arvados_node["last_ping_at"] is None:
+        return None
+    else:

This looks like "node was associated, but has never pinged" so I was surprised that we end up following the "node has pinged recently" path in such a case.

Here is the logic that Brett and I agreed on for this ticket:

  1. If a cloud node hasn't pinged after some boot up period, destroy the node (the "missing" logic doesn't apply)
  2. If a cloud node is "missing" (pinged in the past but not recently), but cloud does not say it is "broken", leave it alone
  3. If a cloud node is "missing" and cloud does say it is "broken", destroy the node

If we want to further tweak this logic, we should take it up in another ticket.

That all sounds fine to me.

Currently the "return None" path (which "shouldn't happen anyway" because currently we never associate a node without setting last_ping_at) exempts a node from all of these checks, right?

Fine if it's deliberate, I only asked because it looked like an oversight.

libcloud doesn't provide much guidance on what any of the node states are supposed to mean, so the for the libcloud Azure ARM driver that I wrote, I used UNKNOWN to indicate that the node is in an unrecognized state (the actual Azure node state reported by the API is some undocumented combination of hardware, cloud, disk, agent and ??? provisioning status, so there an inherent problem is mapping an unknown matrix of finer grained states to a handful of coarsely defined libcloud states). Also, if there an API failure, it raises an API exception, it doesn't return a node record in "UNKNOWN" state.

Indeed, I looked for hints about what states are supposed to mean, and what I found wasn't very illuminating. A quick look at other drivers suggests UNKNOWN is used for states like "crashed" and "shutting down", which seems compatible with the "not worth keeping alive" interpretation.

I have updated the branch with comments relating to the above code sections.

Thanks. The "impossible condition is handled as OK" question isn't a blocker, so this looks good to merge.

#12 Updated by Peter Amstutz over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:05b52b297b30d075ef2409a123f7d096c1156cf8.

Also available in: Atom PDF