Project

General

Profile

Actions

Bug #13804

closed

[Node manager] shutting down nodes when there is a wishlist for same size nodes

Added by Peter Amstutz over 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

There seems to be unnecessary churn in Node Manager, with nodes being shut down and new nodes booted to replace them even when there is a wishlist for a particular node size.

Nico is reporting that there are nodes in draining state which are in the middle of running their jobs, presumably due to lag between the node being reported as "idle", a job being scheduled on the node, and node manager initiating shutdown (which first puts the node into "draining").

Two ideas to reduce churn:

1) Add an "idle grace period" so that nodes have to be idle for a few minutes before they enter shutdown state. A variant of this would be a rule like "node must be reported as idle 2-3 reports in a row" before being eligible for shutdown.

2) After a node enters the "drain" state (meaning it isn't running anything, but won't take any new work), check back with the Daemon actor to see if nodes are wanted for that node size, and if so, cancel shutdown.


Files

13804-a-n-m.log.gz (1.78 MB) 13804-a-n-m.log.gz Nico César, 07/12/2018 08:39 PM
Actions #1

Updated by Peter Amstutz over 5 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
  • Status changed from In Progress to New
Actions #3

Updated by Nico César over 5 years ago

this is a node detected as down:

Jul 12 19:46:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:46:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Not eligible for shut down because node state is ('busy', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:48:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:48:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Not eligible for shut down because node state is ('busy', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:48:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:48:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Not eligible for shut down because node state is ('busy', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:50:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:50:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Not eligible for shut down because node state is ('busy', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:50:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:50:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Not eligible for shut down because node state is ('busy', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:52:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:52:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Not eligible for shut down because node state is ('busy', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:52:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:52:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Not eligible for shut down because node state is ('busy', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:54:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Suggesting shutdown because node state is ('idle', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:54:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Suggesting shutdown because node state is ('idle', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:54:08 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:08 ComputeNodeShutdownActor.3fd78484ae09.compute-ncummtanmzgyemi-e51c5[7018] INFO: Draining SLURM node compute48
Jul 12 19:54:08 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:08 ComputeNodeShutdownActor.3fd78484ae09.compute-ncummtanmzgyemi-e51c5[7018] INFO: Waiting for SLURM node compute48 to drain
Jul 12 19:54:08 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:08 ComputeNodeShutdownActor.3fd78484ae09.compute-ncummtanmzgyemi-e51c5[7018] INFO: Checking that node is still eligible for shutdown
Jul 12 19:54:08 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:08 ComputeNodeShutdownActor.3fd78484ae09.compute-ncummtanmzgyemi-e51c5[7018] INFO: Starting shutdown
Jul 12 19:54:09 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:09 requests.packages.urllib3.connectionpool[7018] DEBUG: "DELETE /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2015-06-15 HTTP/1.1" 202 0
Jul 12 19:54:19 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:19 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:54:29 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:29 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:54:39 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:39 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:54:51 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:54:51 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:55:03 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:55:03 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:55:15 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:55:15 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:55:27 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:55:27 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:55:39 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:55:39 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:55:51 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:55:51 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:56:03 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:03 requests.packages.urllib3.connectionpool[7018] DEBUG: "GET /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Compute/virtualMachines/compute-ncummtanmzgyemi-e51c5?api-version=2016-04-30-preview HTTP/1.1" 200 None
Jul 12 19:56:04 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:04 requests.packages.urllib3.connectionpool[7018] DEBUG: "DELETE /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Network/networkInterfaces/compute-ncummtanmzgyemi-e51c5-nic?api-version=2015-06-15 HTTP/1.1" 400 428
Jul 12 19:56:05 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:05 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Suggesting shutdown because node state is ('down', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:56:06 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:06 ComputeNodeMonitorActor.9fef3c32ce18.compute-ncummtanmzgyemi-e51c5[7018] DEBUG: Suggesting shutdown because node state is ('down', 'open', 'boot exceeded', 'idle exceeded')
Jul 12 19:56:14 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:14 requests.packages.urllib3.connectionpool[7018] DEBUG: "DELETE /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Network/networkInterfaces/compute-ncummtanmzgyemi-e51c5-nic?api-version=2015-06-15 HTTP/1.1" 400 428
Jul 12 19:56:24 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:24 requests.packages.urllib3.connectionpool[7018] DEBUG: "DELETE /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Network/networkInterfaces/compute-ncummtanmzgyemi-e51c5-nic?api-version=2015-06-15 HTTP/1.1" 400 428
Jul 12 19:56:34 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:34 requests.packages.urllib3.connectionpool[7018] DEBUG: "DELETE /subscriptions/3f5c64d4-b5bb-488b-89d1-f44cf7f786c8/resourceGroups/e51c5/providers/Microsoft.Network/networkInterfaces/compute-ncummtanmzgyemi-e51c5-nic?api-version=2015-06-15 HTTP/1.1" 202 0
Jul 12 19:56:34 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:34 requests.packages.urllib3.connectionpool[7018] DEBUG: "HEAD /vhds/compute-ncummtanmzgyemi-e51c5-os_0.vhd HTTP/1.1" 200 0
Jul 12 19:56:34 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:34 requests.packages.urllib3.connectionpool[7018] DEBUG: "DELETE /vhds/compute-ncummtanmzgyemi-e51c5-os_0.vhd HTTP/1.1" 412 267
Jul 12 19:56:54 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:54 requests.packages.urllib3.connectionpool[7018] DEBUG: "HEAD /vhds/compute-ncummtanmzgyemi-e51c5-os_0.vhd HTTP/1.1" 200 0
Jul 12 19:56:54 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:54 requests.packages.urllib3.connectionpool[7018] DEBUG: "DELETE /vhds/compute-ncummtanmzgyemi-e51c5-os_0.vhd HTTP/1.1" 202 None
Jul 12 19:56:54 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:54 ComputeNodeShutdownActor.3fd78484ae09.compute-ncummtanmzgyemi-e51c5[7018] INFO: Shutdown success
Jul 12 19:56:55 manage.e51c5.arvadosapi.com env[7016]: 2018-07-12 19:56:55 ComputeNodeShutdownActor.3fd78484ae09.compute-ncummtanmzgyemi-e51c5[7018] INFO: finished
Actions #4

Updated by Peter Amstutz over 5 years ago

Nico César wrote:

this is a node detected as down:

[...]

I think what happened here is that the node was detected as "idle" and it started shutdown, but continued to monitor the node state. Later the node was put in "drain". The node state turned "drain" into "down" and suggested shutdown again because it doesn't know there's already a shutdown process. So this is harmless.

Actions #5

Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz over 5 years ago

Another thought I had. It might be that this problem is a side effect of the "exact node size" feature. When a new node comes up, it does a slurm "reset" which causes slurm to forget all of the dynamically assigned "node features". As a result, idle nodes can't get work assigned until the "node features" are fixed by node manager. However, fixing the node features races with the "I am idle, please shut me down" logic.

Actions #8

Updated by Tom Clegg over 5 years ago

fwiw the "restore node features when they disappear" kludge runs every 2s, logs some evidence when it notices the features have disappeared, and logs some more if it fails to restore them. As of v1.1.4: source:lib/dispatchcloud/node_size.go@ff353df24233b40612eb228a8024ea30902397e9#L90

Actions #9

Updated by Peter Amstutz over 5 years ago

Tom Clegg wrote:

fwiw the "restore node features when they disappear" kludge runs every 2s, logs some evidence when it notices the features have disappeared, and logs some more if it fails to restore them. As of v1.1.4: source:lib/dispatchcloud/node_size.go@ff353df24233b40612eb228a8024ea30902397e9#L90

The "feature kludge" is only applied to the dummy node to ensure that slurm will accept jobs. I'm talking about the part where it corrects slurm node features on individual nodes, because I think we found that resetting the slurm controller wipes out every dynamic feature tag, right?

https://dev.arvados.org/projects/arvados/repository/revisions/master/entry/services/nodemanager/arvnodeman/computenode/dispatch/slurm.py#L108

This only runs each time the arvados node list is refreshed.

Actions #10

Updated by Peter Amstutz over 5 years ago

I'm also wondering if truncating the wishlist is a problem. That might explain why it believes that idle nodes are "excess" and can be shut down, because it isn't looking past the top of the wishlist.

The excess node count is: up_count - (counts["busy"] + self._size_wishlist(size))

So if there are 100 nodes up, 90 nodes are busy, and the wishlist is 20, but truncated to 10, there would be zero excess nodes. However if one of those busy nodes becomes idle, then there are 100 nodes up, 89 nodes busy, and a wishlist of 10, so then it considers there to be 1 excess node.

However it is more complicated than just not truncating the wishlist. If we take the full wishlist, and a particular node size is at the end of the list (low priority), we might not shut down a node even though nothing will get scheduled on it for a while (preventing it from being shut down and replaced with an appropriately sized node).

Actions #11

Updated by Peter Amstutz over 5 years ago

  • Subject changed from [Node manager] Cancel shutdowns when there is a wishlist for same size nodes to [Node manager] shutting down nodes when there is a wishlist for same size nodes

Another issue. I think we have a head-of-line problem. The highest priority job wants a Standard_D2_v2 but none is available. Nothing behind it can be scheduled until that one is. This leads to a bunch of idle nodes.

Actions #12

Updated by Peter Amstutz over 5 years ago

The priority suddenly falls off from 4294438976 to 1074, I don't understand why?

$ squeue --state=PENDING "--format=%c|%m|%d|%r|%j|%f|%Q" 
...
4|10256M|51024M|ReqNodeNotAvail, UnavailableNod|e51c5-dz642-fwyl5jum3562wkk|instancetype=Standard_D3_v2|4294438977
4|8448M|2G|ReqNodeNotAvail, UnavailableNod|e51c5-dz642-7mu6o90xkcaca4v|instancetype=Standard_D3_v2|4294438976
4|50256M|51024M|Priority|e51c5-dz642-o1tp8f26o78esns|instancetype=Standard_D13_v2|1074
4|10256M|51024M|Priority|e51c5-dz642-g2o0orhouqxz2kt|instancetype=Standard_D3_v2|1002

Re-running it, now a bunch of things have dropped down even more, hitting the floor at 2:

4|10256M|51024M|Priority|e51c5-dz642-0h24sg57t3legs6|instancetype=Standard_D3_v2|4294438537
4|10256M|51024M|Priority|e51c5-dz642-sd4ql4zdjhf26oq|instancetype=Standard_D3_v2|4294438535
2|3456M|2G|Priority|e51c5-dz642-i0pja2ii2ur1k4e|instancetype=Standard_D2_v2|75
2|3456M|2G|Priority|e51c5-dz642-g4dutstai867gh3|instancetype=Standard_D2_v2|72
...
2|4256M|2G|Priority|e51c5-dz642-xizgk2qqhwo78ra|instancetype=Standard_D2_v2|24
2|4256M|2G|Priority|e51c5-dz642-ukxzs5xigyviwpg|instancetype=Standard_D2_v2|21
2|4256M|2G|ReqNodeNotAvail, UnavailableNod|e51c5-dz642-db8g6h5ca8fgxm5|instancetype=Standard_D2_v2|2
2|4256M|2G|ReqNodeNotAvail, UnavailableNod|e51c5-dz642-ebd8ao2p6kcm99x|instancetype=Standard_D2_v2|2
Actions #13

Updated by Lucas Di Pentima over 5 years ago

Reviewing cb64653912a604809c3429f7dbda781741acd18a - branch 13804-no-shutdown-wanted-nodes

Ran nodemanager's integration tests locally - worked OK. Just a couple of small comments:

  • File services/nodemanager/arvnodeman/computenode/dispatch/__init__.py, line 461: The comment needs an update.
  • The new config parameter may need to be documented on cfg file examples.

Other than that, LGTM.

Actions #14

Updated by Peter Amstutz over 4 years ago

  • Target version deleted (To Be Groomed)
  • Status changed from New to Resolved
Actions

Also available in: Atom PDF