https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422015-09-10T18:47:31ZArvadosArvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=300732015-09-10T18:47:31ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>We should take into account the node state reported by the cloud provider, specifically the Node.state field which contains the NodeState enum. A node in the "ERROR" state should be destroyed.</p> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=302982015-09-16T15:30:43ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Target version</strong> set to <i>Arvados Future Sprints</i></li></ul> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=305962015-09-25T19:52:02ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Subject</strong> changed from <i>[Node Manager] Should shut down a node when it appears broken:</i> to <i>[Node Manager] Should recognize and shut down broken nodes</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/30596/diff?detail_id=29957">diff</a>)</li></ul> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=305972015-09-25T19:56:24ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Only nodes reported as PENDING/RUNNING/REBOOTING state should be considered usable, nodes in UNKNOWN/STOPPED/SUSPENDED/PAUSED are unlikely to do useful work for us, only usable nodes should be considered when deciding whether the node request is satisfied, however all nodes regardless of state are considered in the calculation for maximum number of nodes.</p> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=306062015-09-28T14:14:26ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Assigned To</strong> set to <i>Peter Amstutz</i></li></ul> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=306202015-09-29T14:15:18ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><ul>
<li>Is <code>arvados_node["status"] == "missing"</code> a sufficient signal for determining whether the node should be counted as "up" or not? Unlike the decision to shut down the node, this seems fine, because if a node is out of contact but is doing useful work, it won't be affected; if a node is out of contact but idle, it may still be the right thing to do to bring up a new node, since the missing node node can't do work for us.</li>
<li>If a node is otherwise running but has "crunch_worker_state" of "down" should it eligible for shutdown?</li>
</ul> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=307172015-09-30T18:15:39ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Target version</strong> changed from <i>Arvados Future Sprints</i> to <i>2015-10-14 sprint</i></li></ul> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=307362015-09-30T19:34:00ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Story points</strong> set to <i>1.0</i></li></ul> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=310062015-10-05T20:40:35ZTom Cleggtom@curii.com
<ul></ul>Currently arvados_node_missing returns None (falsy) if the node has never pinged. If we ever get into a state where we have an arvados_node associated with a node, but it has never pinged, we won't even ask the cloud driver whether it's broken and we'll skip both of the shutdown tests. Perhaps we should do something like:
<ul>
<li>Return True instead of None from arvados_node_missing if the node has never pinged, or</li>
<li>change <code>self.arvados_node is None</code> → <code>self.arvados_node is None or arvados_node_missing(self.arvados_node, self.node_stale_after) is None</code> in the "Node is unpaired" condition</li>
</ul>
<p>It seems counterintuitive that the Azure driver understands <code>cloud_types.NodeState.UNKNOWN</code> to mean a node is broken. Is this documented? At face value I would expect UNKNOWN to indicate "can't tell you right now because of transient API failure" or something like that -- in that case treating it as "broken" could give us behavior like "once a node passes a certain ping-interval threshold, an unlucky moment's API failure will cause us to kill it" which could be hard to track down. But perhaps we already know we'll see "broken" reported as UNKNOWN? In that case perhaps a log message would be a good idea, in case we start getting UNKNOWN for other reasons?</p>
<p>The rest LGTM.</p> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=310182015-10-06T01:53:31ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
Currently arvados_node_missing returns None (falsy) if the node has never pinged. If we ever get into a state where we have an arvados_node associated with a node, but it has never pinged, we won't even ask the cloud driver whether it's broken and we'll skip both of the shutdown tests. Perhaps we should do something like:
<ul>
<li>Return True instead of None from arvados_node_missing if the node has never pinged, or</li>
<li>change <code>self.arvados_node is None</code> → <code>self.arvados_node is None or arvados_node_missing(self.arvados_node, self.node_stale_after) is None</code> in the "Node is unpaired" condition</li>
</ul>
</blockquote>
<p>Under the current design and assuming there are no bugs that result in outright false information, I don't think it is possible for an arvados node to be associated with a cloud node without a ping, because the ping is what actually establishes the association. Here is the logic that Brett and I agreed on for this ticket:</p>
<ol>
<li>If a cloud node hasn't pinged after some boot up period, destroy the node (the "missing" logic doesn't apply)</li>
<li>If a cloud node is "missing" (pinged in the past but not recently), but cloud does not say it is "broken", leave it alone</li>
<li>If a cloud node is "missing" and cloud <strong>does</strong> say it is "broken", destroy the node</li>
</ol>
<p>If we want to further tweak this logic, we should take it up in another ticket.</p>
<blockquote>
<p>It seems counterintuitive that the Azure driver understands <code>cloud_types.NodeState.UNKNOWN</code> to mean a node is broken. Is this documented? At face value I would expect UNKNOWN to indicate "can't tell you right now because of transient API failure" or something like that -- in that case treating it as "broken" could give us behavior like "once a node passes a certain ping-interval threshold, an unlucky moment's API failure will cause us to kill it" which could be hard to track down. But perhaps we already know we'll see "broken" reported as UNKNOWN? In that case perhaps a log message would be a good idea, in case we start getting UNKNOWN for other reasons?</p>
</blockquote>
<p>libcloud doesn't provide much guidance on what any of the node states are supposed to mean, so the for the libcloud Azure ARM driver that I wrote, I used UNKNOWN to indicate that the node is in an unrecognized state (the actual Azure node state reported by the API is some undocumented combination of hardware, cloud, disk, agent and ??? provisioning status, so there an inherent problem is mapping an unknown matrix of finer grained states to a handful of coarsely defined libcloud states). Also, if there an API failure, it raises an API exception, it doesn't return a node record in "UNKNOWN" state.</p>
<p>I have updated the branch with comments relating to the above code sections.</p> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=310242015-10-06T04:04:28ZTom Cleggtom@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>Under the current design and assuming there are no bugs that result in outright false information, I don't think it is possible for an arvados node to be associated with a cloud node without a ping, because the ping is what actually establishes the association.</p>
</blockquote>
<p>OK. I asked because of these three lines, which seem to address specifically this possibility:</p>
<pre><code class="diff syntaxhl"><span class="gi">+ if arvados_node["last_ping_at"] is None:
+ return None
+ else:
</span></code></pre>
<p>This looks like "node was associated, but has never pinged" so I was surprised that we end up following the "node has pinged recently" path in such a case.</p>
<blockquote>
<p>Here is the logic that Brett and I agreed on for this ticket:</p>
<ol>
<li>If a cloud node hasn't pinged after some boot up period, destroy the node (the "missing" logic doesn't apply)</li>
<li>If a cloud node is "missing" (pinged in the past but not recently), but cloud does not say it is "broken", leave it alone</li>
<li>If a cloud node is "missing" and cloud <strong>does</strong> say it is "broken", destroy the node</li>
</ol>
<p>If we want to further tweak this logic, we should take it up in another ticket.</p>
</blockquote>
<p>That all sounds fine to me.</p>
<p>Currently the "return None" path (which "shouldn't happen anyway" because currently we never associate a node without setting last_ping_at) exempts a node from all of these checks, right?</p>
<p>Fine if it's deliberate, I only asked because it looked like an oversight.</p>
<blockquote>
<p>libcloud doesn't provide much guidance on what any of the node states are supposed to mean, so the for the libcloud Azure ARM driver that I wrote, I used UNKNOWN to indicate that the node is in an unrecognized state (the actual Azure node state reported by the API is some undocumented combination of hardware, cloud, disk, agent and ??? provisioning status, so there an inherent problem is mapping an unknown matrix of finer grained states to a handful of coarsely defined libcloud states). Also, if there an API failure, it raises an API exception, it doesn't return a node record in "UNKNOWN" state.</p>
</blockquote>
<p>Indeed, I looked for hints about what states are supposed to mean, and what I found wasn't very illuminating. A quick look at other drivers suggests UNKNOWN is used for states like "crashed" and "shutting down", which seems compatible with the "not worth keeping alive" interpretation.</p>
<blockquote>
<p>I have updated the branch with comments relating to the above code sections.</p>
</blockquote>
<p>Thanks. The "impossible condition is handled as OK" question isn't a blocker, so this looks good to merge.</p> Arvados - Bug #7286: [Node Manager] Should recognize and shut down broken nodeshttps://dev.arvados.org/issues/7286?journal_id=310262015-10-06T13:20:10ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset arvados|commit:05b52b297b30d075ef2409a123f7d096c1156cf8.</p>