[Node Manager] When canceling a SLURM shutdown, check state before resuming the node
Node Manager's SLURM dispatcher always tries to resume the node in SLURM when a shutdown is canceled. However, this request is only valid if the node is drained or failed. In other cases—for example, if the node is idle or alloc because it was never drained to begin with—issuing this request is invalid, and scontrol exits 1. This causes ComputeNodeShutdownActor to enter an infinite loop, trying repeatedly to resume a node that will never resume.
Check the node's current state (you can refactor code from await_slurm_drain), and only issue the resume request if that state is
#6 Updated by Brett Smith over 4 years ago
This state occurs if the node is allocated before the drain requests goes through. This can happen if Node Manager simply loses the race with crunch-dispatch, or if something interferes with the drain request like #6321.
Once the node is allocated, it will no longer be eligible for shutdown, and Node Manager will try to cancel the pending node shutdown. The first step of that is resuming the node in SLURM—but that can't succeed if the node isn't already drained. So that request fails, and then Node Manager's state is bad.