Project

General

Profile

Actions

Bug #6142

closed

[Node Manager] When canceling a SLURM shutdown, check state before resuming the node

Added by Brett Smith almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
0.5

Description

Node Manager's SLURM dispatcher always tries to resume the node in SLURM when a shutdown is canceled. However, this request is only valid if the node is drained or failed. In other cases—for example, if the node is idle or alloc because it was never drained to begin with—issuing this request is invalid, and scontrol exits 1. This causes ComputeNodeShutdownActor to enter an infinite loop, trying repeatedly to resume a node that will never resume.

Check the node's current state (you can refactor code from await_slurm_drain), and only issue the resume request if that state is drain or drng.


Subtasks 1 (0 open1 closed)

Task #7415: Review 6142-cancel-slurmResolvedPeter Amstutz05/22/2015Actions
Actions #1

Updated by Radhika Chippada over 8 years ago

  • Target version changed from Arvados Future Sprints to 2015-09-02 sprint
Actions #2

Updated by Brett Smith over 8 years ago

  • Target version changed from 2015-09-02 sprint to Arvados Future Sprints
Actions #3

Updated by Brett Smith over 8 years ago

  • Story points set to 0.5
Actions #4

Updated by Brett Smith over 8 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-14 sprint
Actions #5

Updated by Peter Amstutz over 8 years ago

  • Assigned To set to Peter Amstutz
Actions #6

Updated by Brett Smith over 8 years ago

This state occurs if the node is allocated before the drain requests goes through. This can happen if Node Manager simply loses the race with crunch-dispatch, or if something interferes with the drain request like #6321.

Once the node is allocated, it will no longer be eligible for shutdown, and Node Manager will try to cancel the pending node shutdown. The first step of that is resuming the node in SLURM—but that can't succeed if the node isn't already drained. So that request fails, and then Node Manager's state is bad.

Actions #7

Updated by Nico César over 8 years ago

review @ 2a94b125b93a3aba204f55c37ecdc2876d81d642

I looked at the code and I ran the tests (all passed). This is new code for me, It's not evident for my current knowledge to have a major problem here.

LGTM

Actions #8

Updated by Peter Amstutz over 8 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:7c97dd88e541a0245272b8e93a33e4d2fe4e32cd.

Actions

Also available in: Atom PDF