Actions
Feature #8515
closed[Node Manager] Detect and shut down nodes that fail provisioning in Azure
Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
-
Story points:
-
Description
provisioning of the compute node was in "Failed" state
$ azure vm list qr1hi info: Executing command vm list + Getting virtual machines data: Name ProvisioningState PowerState Location Size data: ----------------------------- ----------------- ---------- -------- -------------- data: compute-5l19dkbzi1shjqk-qr1hi Failed eastus2 Standard_D1_v2
but nodemanager logs are showing it as "booting":
2016-02-22_18:51:04.28146 2016-02-22 18:51:04 NodeManagerDaemonActor.dc7dddf0a71f[1196] INFO: Standard_D1_v2: wishlist 1, up 1 (booting 1, idle 0, busy 0), shutting down 0
then a weird state happened:
2016-02-22_18:52:04.30416 2016-02-22 18:52:04 NodeManagerDaemonActor.dc7dddf0a71f[1196] INFO: Standard_D1_v2: wishlist 2, up 2 (booting 3, idle -1, busy 0), shutting down 1 2016-02-22_18:52:04.30454 2016-02-22 18:52:04 JobQueueMonitorActor.139896115515440[1196] INFO: got response with 2 items in 0.102003097534 seconds, next poll at 2016-02-22 18:52:14 2016-02-22_18:52:04.64431 2016-02-22 18:52:04 ArvadosNodeListMonitorActor.139896125562032[1196] DEBUG: sending request 2016-02-22_18:52:05.35226 2016-02-22 18:52:05 ArvadosNodeListMonitorActor.139896125562032[1196] INFO: got response with 340 items in 0.711224079132 seconds, next poll at 2016-02-22 18:52:14 2016-02-22_18:52:11.29016 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Shutdown success 2016-02-22_18:52:11.29764 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: finished 2016-02-22_18:52:11.29865 2016-02-22 18:52:11 pykka[1196] DEBUG: Unregistered ComputeNodeShutdownActor (urn:uuid:cc3d6094-ec1f-45d6-abe6-9b4bde9f1a62) 2016-02-22_18:52:11.29887 2016-02-22 18:52:11 pykka[1196] DEBUG: Stopped ComputeNodeShutdownActor (urn:uuid:cc3d6094-ec1f-45d6-abe6-9b4bde9f1a62)
note "idle -1"
and finally shutdown.
2016-02-22_18:52:44.00145 2016-02-22 18:52:44 ComputeNodeMonitorActor.7177662417db.compute-5l19dkbzi1shjqk-qr1hi[1196] DEBUG: Suggesting shutdown.
but note the wrong order of ComputeNodeShutdownActor and ComputeNodeMonitorActor:
$ grep 5l19dkbzi1shjqk current 2016-02-22_18:51:49.18819 2016-02-22 18:51:49 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Starting shutdown 2016-02-22_18:52:11.29016 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Shutdown success 2016-02-22_18:52:11.29764 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: finished 2016-02-22_18:52:44.00145 2016-02-22 18:52:44 ComputeNodeMonitorActor.7177662417db.compute-5l19dkbzi1shjqk-qr1hi[1196] DEBUG: Suggesting shutdown.
see attachment logs for further details.
Files
Actions