Story #8000

[Node Manager] Shut down nodes in SLURM 'down' state

Added by Peter Amstutz almost 5 years ago. Updated about 2 years ago.

Assigned To:
Node Manager
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


Apparently node manager only shuts down nodes that are "idle" in slurm, if they are "down" then they don't get shut down?

2015-12-11_20:41:05.08909 2015-12-11 20:41:05 arvnodeman.cloud_nodes[11545] DEBUG: CloudNodeListMonitorActor (at 140548410010704) got response with 1 items
2015-12-11_20:41:05.09007 2015-12-11 20:41:05 arvnodeman.daemon[11545] INFO: Registering new cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk
2015-12-11_20:41:05.09273 2015-12-11 20:41:05 pykka[11545] DEBUG: Registered ComputeNodeMonitorActor (urn:uuid:83697dab-e718-4fd5-8595-b6563015585c)
2015-12-11_20:41:05.09280 2015-12-11 20:41:05 pykka[11545] DEBUG: Starting ComputeNodeMonitorActor (urn:uuid:83697dab-e718-4fd5-8595-b6563015585c)
2015-12-11_20:41:05.09391 2015-12-11 20:41:05 arvnodeman.computenode[11545] DEBUG: Node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk suggesting shutdown.
2015-12-11_20:41:05.09584 2015-12-11 20:41:05 arvnodeman.cloud_nodes[11545] DEBUG: <pykka.proxy._CallableProxy object at 0x7fd3f81b0850> subscribed to events for '/subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk'
2015-12-11_20:41:05.09804 2015-12-11 20:41:05 arvnodeman.daemon[11545] INFO: Cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk has associated with Arvados node c97qk-7ekkf-tj4hwdsw3yjiyjt
2015-12-11_20:41:05.09921 2015-12-11 20:41:05 arvnodeman.computenode[11545] DEBUG: Node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-tj4hwdsw3yjiyjt-c97qk shutdown window open but node busy.
2015-12-11_20:41:05.10064 2015-12-11 20:41:05 arvnodeman.arvados_nodes[11545] DEBUG: <pykka.proxy._CallableProxy object at 0x7fd3f8e11250> subscribed to events for 'c97qk-7ekkf-tj4hwdsw3yjiyjt'
$ arv node get -u c97qk-7ekkf-tj4hwdsw3yjiyjt
  "last_action":"Prepared by Node Manager",
compute*     up   infinite      2 drain* compute[2-3]
compute*     up   infinite    252  down* compute[1,4-14,16-255]
compute*     up   infinite      1   idle compute15
compute*     up   infinite      1   down compute0

Related issues

Related to Arvados - Bug #8799: [Node manager] nodes in slurm drained state are counted as "up" but not candidates for shut downResolved04/06/2016

Related to Arvados - Bug #8953: [Node manager] can not shut down nodes anymoreResolved04/13/2016


#1 Updated by Peter Amstutz almost 5 years ago

  • Description updated (diff)

#2 Updated by Brett Smith almost 5 years ago

  • Subject changed from [NodeManager] shuts down 'idle' nodes but not 'down' nodes to [Node Manager] Does not shut down nodes in SLURM 'down' state
  • Category set to Node Manager

This was discussed and desired behavior at the time the code was written. The thinking then was that a node being down in SLURM may just mean there's a network issue, and plenty of jobs can do their compute without network access just fine, so it's better to leave the node up and try to let the work finish than shut it down. An admin will intervene if necessary.

Since then:

  • Now that we have Node Manager, admins want to intervene less.
  • Nobody's said it in as many words, but I think we've shifted our philosophy about how to handle weird cases from "avoid doing anything that might interrupt to compute work" to "get the cluster into a known-good state ASAP."
  • Given what I know about SLURM now, it's not clear to me that compute work can continue successfully even against transient network failures. It seems more likely that, in that case, SLURM will note the node failure and cancel the job allocation.

If all of that makes sense to everyone else, I agree we should change the behavior in this case.

#3 Updated by Tom Clegg almost 5 years ago

I'd say "slurm says node is down but everything will be fine if we're lucky" was somewhat true before we figured out that we needed to flatten the slurm node-communication tree.

#4 Updated by Brett Smith over 4 years ago

  • Target version set to Arvados Future Sprints

#5 Updated by Brett Smith over 4 years ago

  • Tracker changed from Bug to Story
  • Subject changed from [Node Manager] Does not shut down nodes in SLURM 'down' state to [Node Manager] Shut down nodes in SLURM 'down' state

#6 Updated by Peter Amstutz over 3 years ago

  • Status changed from New to Resolved

This was fixed #8953 with the addition of an explicit state transition table.

#7 Updated by Tom Morris about 2 years ago

  • Target version deleted (Arvados Future Sprints)

Also available in: Atom PDF