Project

General

Profile

Actions

Bug #7286

closed

[Node Manager] Should recognize and shut down broken nodes

Added by Brett Smith almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
09/10/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

"Broken" means the cloud provider asserts that it is broken, and one of the following is true:

  • The cloud node is unpaired, and at least boot_fail_after seconds old.
  • The cloud node is paired, and the associated Arvados record has status "missing".

Steps:

  • Add a method to node drivers that takes a cloud node record as an argument. It returns True if the record indicates the node is broken, False otherwise.
  • ComputeNodeMonitorActor suggests its node for shutdown if this new method returns True, and one of the conditions above is true.
  • Remove the shutdown_unpaired_node logic from NodeManagerDaemonActor, since the point above effectively moves it to the ComputeNodeMonitorActor.
  • Update the daemon's _nodes_wanted/_nodes_excess math so that we boot replacements for nodes in this failed state unless we're at max_nodes. You could do this by simply counting them in _nodes_busy, but please be careful: we don't want to accidentally revert e81a3e9b. The daemon should be able to affirmatively know that the node is completely failed.

Subtasks 2 (0 open2 closed)

Task #7409: Review 7286-nodeman-destroy-broken-nodesResolvedTom Clegg09/10/2015

Actions
Task #7460: Deploy arvados-node-manager-0.1.20151006131743 and python-apache-libcloud-0.18.1dev4 on c97qkResolvedNico C├ęsar09/10/2015

Actions
Actions

Also available in: Atom PDF