Project

General

Profile

Actions

Feature #8515

closed

[Node Manager] Detect and shut down nodes that fail provisioning in Azure

Added by Nico César about 8 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
-
Story points:
-

Description

provisioning of the compute node was in "Failed" state

$ azure vm list qr1hi
info:    Executing command vm list
+ Getting virtual machines                                                     
data:    Name                           ProvisioningState  PowerState  Location  Size          
data:    -----------------------------  -----------------  ----------  --------  --------------
data:    compute-5l19dkbzi1shjqk-qr1hi  Failed                         eastus2   Standard_D1_v2

but nodemanager logs are showing it as "booting":

2016-02-22_18:51:04.28146 2016-02-22 18:51:04 NodeManagerDaemonActor.dc7dddf0a71f[1196] INFO: Standard_D1_v2: wishlist 1, up 1 (booting 1, idle 0, busy 0), shutting down 0

then a weird state happened:

2016-02-22_18:52:04.30416 2016-02-22 18:52:04 NodeManagerDaemonActor.dc7dddf0a71f[1196] INFO: Standard_D1_v2: wishlist 2, up 2 (booting 3, idle -1, busy 0), shutting down 1
2016-02-22_18:52:04.30454 2016-02-22 18:52:04 JobQueueMonitorActor.139896115515440[1196] INFO: got response with 2 items in 0.102003097534 seconds, next poll at 2016-02-22 18:52:14
2016-02-22_18:52:04.64431 2016-02-22 18:52:04 ArvadosNodeListMonitorActor.139896125562032[1196] DEBUG: sending request
2016-02-22_18:52:05.35226 2016-02-22 18:52:05 ArvadosNodeListMonitorActor.139896125562032[1196] INFO: got response with 340 items in 0.711224079132 seconds, next poll at 2016-02-22 18:52:14
2016-02-22_18:52:11.29016 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Shutdown success
2016-02-22_18:52:11.29764 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: finished
2016-02-22_18:52:11.29865 2016-02-22 18:52:11 pykka[1196] DEBUG: Unregistered ComputeNodeShutdownActor (urn:uuid:cc3d6094-ec1f-45d6-abe6-9b4bde9f1a62)
2016-02-22_18:52:11.29887 2016-02-22 18:52:11 pykka[1196] DEBUG: Stopped ComputeNodeShutdownActor (urn:uuid:cc3d6094-ec1f-45d6-abe6-9b4bde9f1a62)

note "idle -1"

and finally shutdown.

2016-02-22_18:52:44.00145 2016-02-22 18:52:44 ComputeNodeMonitorActor.7177662417db.compute-5l19dkbzi1shjqk-qr1hi[1196] DEBUG: Suggesting shutdown.

but note the wrong order of ComputeNodeShutdownActor and ComputeNodeMonitorActor:

$ grep 5l19dkbzi1shjqk current
2016-02-22_18:51:49.18819 2016-02-22 18:51:49 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Starting shutdown
2016-02-22_18:52:11.29016 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Shutdown success
2016-02-22_18:52:11.29764 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: finished
2016-02-22_18:52:44.00145 2016-02-22 18:52:44 ComputeNodeMonitorActor.7177662417db.compute-5l19dkbzi1shjqk-qr1hi[1196] DEBUG: Suggesting shutdown.

see attachment logs for further details.


Files

current.gz (18.7 KB) current.gz Nico César, 02/22/2016 06:57 PM
Actions

Also available in: Atom PDF