Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' state</h1> <article> <h1>Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' state</h1> <p>2015-12-11T21:02:10Z</p> <ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/33435/diff?detail_id=32840">diff</a>)</li></ul> </article> <article> <h1>Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' state</h1> <p>2015-12-11T22:52:21Z</p> <ul><li><strong>Subject</strong> changed from <i>[NodeManager] shuts down 'idle' nodes but not 'down' nodes</i> to <i>[Node Manager] Does not shut down nodes in SLURM 'down' state</i></li><li><strong>Category</strong> set to <i>Node Manager</i></li></ul><p>This was discussed and desired behavior at the time the code was written. The thinking then was that a node being down in SLURM may just mean there's a network issue, and plenty of jobs can do their compute without network access just fine, so it's better to leave the node up and try to let the work finish than shut it down. An admin will intervene if necessary.</p> <p>Since then:</p> <ul> <li>Now that we have Node Manager, admins want to intervene less.</li> <li>Nobody's said it in as many words, but I think we've shifted our philosophy about how to handle weird cases from "avoid doing anything that might interrupt to compute work" to "get the cluster into a known-good state ASAP." </li> <li>Given what I know about SLURM now, it's not clear to me that compute work can continue successfully even against transient network failures. It seems more likely that, in that case, SLURM will note the node failure and cancel the job allocation.</li> </ul> <p>If all of that makes sense to everyone else, I agree we should change the behavior in this case.</p> </article> <article> <h1>Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' state</h1> <p>2015-12-14T21:39:15Z</p> <ul></ul><p>I'd say "slurm says node is down but everything will be fine if we're lucky" was somewhat true before we figured out that we needed to flatten the slurm node-communication tree.</p> </article> <article> <h1>Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' state</h1> <p>2016-01-05T14:23:19Z</p> <ul><li><strong>Target version</strong> set to <i>Arvados Future Sprints</i></li></ul> </article> <article> <h1>Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' state</h1> <p>2016-01-05T14:25:58Z</p> <ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Idea</i></li><li><strong>Subject</strong> changed from <i>[Node Manager] Does not shut down nodes in SLURM 'down' state</i> to <i>[Node Manager] Shut down nodes in SLURM 'down' state</i></li></ul> </article> <article> <h1>Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' state</h1> <p>2017-05-05T12:24:13Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li></ul><p>This was fixed <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: [Node manager] can not shut down nodes anymore (Resolved)" href="https://dev.arvados.org/issues/8953">#8953</a> with the addition of an explicit state transition table.</p> </article> <article> <h1>Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' state</h1> <p>2018-09-12T16:59:15Z</p> <ul><li><strong>Target version</strong> deleted (<del><i>Arvados Future Sprints</i></del>)</li></ul> </article> </main></body></html>