Bug #6142

[Node Manager] When canceling a SLURM shutdown, check state before resuming the node

Added by Brett Smith about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
05/22/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

Node Manager's SLURM dispatcher always tries to resume the node in SLURM when a shutdown is canceled. However, this request is only valid if the node is drained or failed. In other cases—for example, if the node is idle or alloc because it was never drained to begin with—issuing this request is invalid, and scontrol exits 1. This causes ComputeNodeShutdownActor to enter an infinite loop, trying repeatedly to resume a node that will never resume.

Check the node's current state (you can refactor code from await_slurm_drain), and only issue the resume request if that state is drain or drng.


Subtasks

Task #7415: Review 6142-cancel-slurmResolvedPeter Amstutz

Associated revisions

Revision 7c97dd88
Added by Peter Amstutz over 4 years ago

Merge branch '6142-cancel-slurm' closes #6142

History

#1 Updated by Radhika Chippada almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2015-09-02 sprint

#2 Updated by Brett Smith almost 5 years ago

  • Target version changed from 2015-09-02 sprint to Arvados Future Sprints

#3 Updated by Brett Smith over 4 years ago

  • Story points set to 0.5

#4 Updated by Brett Smith over 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-14 sprint

#5 Updated by Peter Amstutz over 4 years ago

  • Assigned To set to Peter Amstutz

#6 Updated by Brett Smith over 4 years ago

This state occurs if the node is allocated before the drain requests goes through. This can happen if Node Manager simply loses the race with crunch-dispatch, or if something interferes with the drain request like #6321.

Once the node is allocated, it will no longer be eligible for shutdown, and Node Manager will try to cancel the pending node shutdown. The first step of that is resuming the node in SLURM—but that can't succeed if the node isn't already drained. So that request fails, and then Node Manager's state is bad.

#7 Updated by Nico César over 4 years ago

review @ 2a94b125b93a3aba204f55c37ecdc2876d81d642

I looked at the code and I ran the tests (all passed). This is new code for me, It's not evident for my current knowledge to have a major problem here.

LGTM

#8 Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:7c97dd88e541a0245272b8e93a33e4d2fe4e32cd.

Also available in: Atom PDF