Feature #8515

[Node Manager] Detect and shut down nodes that fail provisioning in Azure

Added by Nico César almost 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
Start date:
02/22/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

provisioning of the compute node was in "Failed" state

$ azure vm list qr1hi
info:    Executing command vm list
+ Getting virtual machines                                                     
data:    Name                           ProvisioningState  PowerState  Location  Size          
data:    -----------------------------  -----------------  ----------  --------  --------------
data:    compute-5l19dkbzi1shjqk-qr1hi  Failed                         eastus2   Standard_D1_v2

but nodemanager logs are showing it as "booting":

2016-02-22_18:51:04.28146 2016-02-22 18:51:04 NodeManagerDaemonActor.dc7dddf0a71f[1196] INFO: Standard_D1_v2: wishlist 1, up 1 (booting 1, idle 0, busy 0), shutting down 0

then a weird state happened:

2016-02-22_18:52:04.30416 2016-02-22 18:52:04 NodeManagerDaemonActor.dc7dddf0a71f[1196] INFO: Standard_D1_v2: wishlist 2, up 2 (booting 3, idle -1, busy 0), shutting down 1
2016-02-22_18:52:04.30454 2016-02-22 18:52:04 JobQueueMonitorActor.139896115515440[1196] INFO: got response with 2 items in 0.102003097534 seconds, next poll at 2016-02-22 18:52:14
2016-02-22_18:52:04.64431 2016-02-22 18:52:04 ArvadosNodeListMonitorActor.139896125562032[1196] DEBUG: sending request
2016-02-22_18:52:05.35226 2016-02-22 18:52:05 ArvadosNodeListMonitorActor.139896125562032[1196] INFO: got response with 340 items in 0.711224079132 seconds, next poll at 2016-02-22 18:52:14
2016-02-22_18:52:11.29016 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Shutdown success
2016-02-22_18:52:11.29764 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: finished
2016-02-22_18:52:11.29865 2016-02-22 18:52:11 pykka[1196] DEBUG: Unregistered ComputeNodeShutdownActor (urn:uuid:cc3d6094-ec1f-45d6-abe6-9b4bde9f1a62)
2016-02-22_18:52:11.29887 2016-02-22 18:52:11 pykka[1196] DEBUG: Stopped ComputeNodeShutdownActor (urn:uuid:cc3d6094-ec1f-45d6-abe6-9b4bde9f1a62)

note "idle -1"

and finally shutdown.

2016-02-22_18:52:44.00145 2016-02-22 18:52:44 ComputeNodeMonitorActor.7177662417db.compute-5l19dkbzi1shjqk-qr1hi[1196] DEBUG: Suggesting shutdown.

but note the wrong order of ComputeNodeShutdownActor and ComputeNodeMonitorActor:

$ grep 5l19dkbzi1shjqk current
2016-02-22_18:51:49.18819 2016-02-22 18:51:49 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Starting shutdown
2016-02-22_18:52:11.29016 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Shutdown success
2016-02-22_18:52:11.29764 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: finished
2016-02-22_18:52:44.00145 2016-02-22 18:52:44 ComputeNodeMonitorActor.7177662417db.compute-5l19dkbzi1shjqk-qr1hi[1196] DEBUG: Suggesting shutdown.

see attachment logs for further details.

current.gz (18.7 KB) current.gz Nico César, 02/22/2016 06:57 PM

History

#1 Updated by Brett Smith almost 3 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from [NODEMANAGER] Azure's failures aren't catched to [Node Manager] Detect and shut down nodes that fail provisioning in Azure
  • Category set to Node Manager

Nico Cesar wrote:

provisioning of the compute node was in "Failed" state

Right now Node Manager starts booting a node, and sets a message to check on it after boot_fail_after seconds. If the node hasn't pinged Arvados by then, it gets shut down. That's basically what happened here. We can improve Node Manager's responsiveness by detecting provisioning failures and suggesting shutdown more proactively in that case.

but note the wrong order of ComputeNodeShutdownActor and ComputeNodeMonitorActor:

This is not a big deal. The MonitorActor is stopped after a node no longer appears in a cloud's listings. Thanks to eventual consistency, it's normal for there to be a small gap between when a ShutdownActor finished, and when the node is removed from cloud listings, stopping the MonitorActor.

#2 Updated by Brett Smith almost 3 years ago

  • Target version set to Arvados Future Sprints

Also available in: Atom PDF