Feature #8515
closed[Node Manager] Detect and shut down nodes that fail provisioning in Azure
Description
provisioning of the compute node was in "Failed" state
$ azure vm list qr1hi info: Executing command vm list + Getting virtual machines data: Name ProvisioningState PowerState Location Size data: ----------------------------- ----------------- ---------- -------- -------------- data: compute-5l19dkbzi1shjqk-qr1hi Failed eastus2 Standard_D1_v2
but nodemanager logs are showing it as "booting":
2016-02-22_18:51:04.28146 2016-02-22 18:51:04 NodeManagerDaemonActor.dc7dddf0a71f[1196] INFO: Standard_D1_v2: wishlist 1, up 1 (booting 1, idle 0, busy 0), shutting down 0
then a weird state happened:
2016-02-22_18:52:04.30416 2016-02-22 18:52:04 NodeManagerDaemonActor.dc7dddf0a71f[1196] INFO: Standard_D1_v2: wishlist 2, up 2 (booting 3, idle -1, busy 0), shutting down 1 2016-02-22_18:52:04.30454 2016-02-22 18:52:04 JobQueueMonitorActor.139896115515440[1196] INFO: got response with 2 items in 0.102003097534 seconds, next poll at 2016-02-22 18:52:14 2016-02-22_18:52:04.64431 2016-02-22 18:52:04 ArvadosNodeListMonitorActor.139896125562032[1196] DEBUG: sending request 2016-02-22_18:52:05.35226 2016-02-22 18:52:05 ArvadosNodeListMonitorActor.139896125562032[1196] INFO: got response with 340 items in 0.711224079132 seconds, next poll at 2016-02-22 18:52:14 2016-02-22_18:52:11.29016 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Shutdown success 2016-02-22_18:52:11.29764 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: finished 2016-02-22_18:52:11.29865 2016-02-22 18:52:11 pykka[1196] DEBUG: Unregistered ComputeNodeShutdownActor (urn:uuid:cc3d6094-ec1f-45d6-abe6-9b4bde9f1a62) 2016-02-22_18:52:11.29887 2016-02-22 18:52:11 pykka[1196] DEBUG: Stopped ComputeNodeShutdownActor (urn:uuid:cc3d6094-ec1f-45d6-abe6-9b4bde9f1a62)
note "idle -1"
and finally shutdown.
2016-02-22_18:52:44.00145 2016-02-22 18:52:44 ComputeNodeMonitorActor.7177662417db.compute-5l19dkbzi1shjqk-qr1hi[1196] DEBUG: Suggesting shutdown.
but note the wrong order of ComputeNodeShutdownActor and ComputeNodeMonitorActor:
$ grep 5l19dkbzi1shjqk current 2016-02-22_18:51:49.18819 2016-02-22 18:51:49 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Starting shutdown 2016-02-22_18:52:11.29016 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: Shutdown success 2016-02-22_18:52:11.29764 2016-02-22 18:52:11 ComputeNodeShutdownActor.9b4bde9f1a62.compute-5l19dkbzi1shjqk-qr1hi[1196] INFO: finished 2016-02-22_18:52:44.00145 2016-02-22 18:52:44 ComputeNodeMonitorActor.7177662417db.compute-5l19dkbzi1shjqk-qr1hi[1196] DEBUG: Suggesting shutdown.
see attachment logs for further details.
Files
Updated by Brett Smith about 8 years ago
- Tracker changed from Bug to Feature
- Subject changed from [NODEMANAGER] Azure's failures aren't catched to [Node Manager] Detect and shut down nodes that fail provisioning in Azure
- Category set to Node Manager
Nico Cesar wrote:
provisioning of the compute node was in "Failed" state
Right now Node Manager starts booting a node, and sets a message to check on it after boot_fail_after seconds. If the node hasn't pinged Arvados by then, it gets shut down. That's basically what happened here. We can improve Node Manager's responsiveness by detecting provisioning failures and suggesting shutdown more proactively in that case.
but note the wrong order of ComputeNodeShutdownActor and ComputeNodeMonitorActor:
This is not a big deal. The MonitorActor is stopped after a node no longer appears in a cloud's listings. Thanks to eventual consistency, it's normal for there to be a small gap between when a ShutdownActor finished, and when the node is removed from cloud listings, stopping the MonitorActor.
Updated by Brett Smith about 8 years ago
- Target version set to Arvados Future Sprints
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)