Project

General

Profile

Actions

Bug #6142

closed

[Node Manager] When canceling a SLURM shutdown, check state before resuming the node

Added by Brett Smith almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
0.5

Description

Node Manager's SLURM dispatcher always tries to resume the node in SLURM when a shutdown is canceled. However, this request is only valid if the node is drained or failed. In other cases—for example, if the node is idle or alloc because it was never drained to begin with—issuing this request is invalid, and scontrol exits 1. This causes ComputeNodeShutdownActor to enter an infinite loop, trying repeatedly to resume a node that will never resume.

Check the node's current state (you can refactor code from await_slurm_drain), and only issue the resume request if that state is drain or drng.


Subtasks 1 (0 open1 closed)

Task #7415: Review 6142-cancel-slurmResolvedPeter Amstutz05/22/2015Actions
Actions

Also available in: Atom PDF