Project

General

Profile

Actions

Bug #8953

closed

[Node manager] can not shut down nodes anymore

Added by Ward Vandewege about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

Node manager version 0.1.20160410021132-1 (with the drain fix from #8799) appears to have a shutdown loop...

2016-04-12_21:55:42.54948 2016-04-12 21:55:42 ComputeNodeShutdownActor.01df952eda4b.compute-q74ljtuggvufc9l-su92l[46608] INFO: Draining SLURM node compute19
2016-04-12_21:55:42.63657 2016-04-12 21:55:42 ComputeNodeShutdownActor.01df952eda4b.compute-q74ljtuggvufc9l-su92l[46608] INFO: Waiting for SLURM node compute19 to drain
2016-04-12_21:55:42.71020 2016-04-12 21:55:42 ComputeNodeShutdownActor.01df952eda4b.compute-q74ljtuggvufc9l-su92l[46608] INFO: Shutdown cancelled: shutdown window closed.
2016-04-12_21:55:42.71048 2016-04-12 21:55:42 ComputeNodeShutdownActor.01df952eda4b.compute-q74ljtuggvufc9l-su92l[46608] INFO: finished
2016-04-12_21:55:43.14420 2016-04-12 21:55:43 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:58d90fdd-7dee-4219-9ebb-429da8a893ef)
2016-04-12_21:55:43.14471 2016-04-12 21:55:43 pykka[46608] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:58d90fdd-7dee-4219-9ebb-429da8a893ef)
2016-04-12_21:55:43.15100 2016-04-12 21:55:43 ComputeNodeShutdownActor.429da8a893ef.compute-csm26hx92t8pf8o-su92l[46608] INFO: Draining SLURM node compute20
2016-04-12_21:55:43.27655 2016-04-12 21:55:43 ComputeNodeShutdownActor.429da8a893ef.compute-csm26hx92t8pf8o-su92l[46608] INFO: Waiting for SLURM node compute20 to drain
2016-04-12_21:55:43.36943 2016-04-12 21:55:43 ComputeNodeShutdownActor.429da8a893ef.compute-csm26hx92t8pf8o-su92l[46608] INFO: Shutdown cancelled: shutdown window closed.
2016-04-12_21:55:43.36970 2016-04-12 21:55:43 ComputeNodeShutdownActor.429da8a893ef.compute-csm26hx92t8pf8o-su92l[46608] INFO: finished
2016-04-12_21:55:43.87084 2016-04-12 21:55:43 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:f1192c0a-dbad-4bd6-9a70-949cb086f5bf)
2016-04-12_21:55:43.87123 2016-04-12 21:55:43 pykka[46608] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:f1192c0a-dbad-4bd6-9a70-949cb086f5bf)
2016-04-12_21:55:43.87793 2016-04-12 21:55:43 ComputeNodeShutdownActor.949cb086f5bf.compute-1q9g2jmg5askowg-su92l[46608] INFO: Draining SLURM node compute36
2016-04-12_21:55:44.08007 2016-04-12 21:55:44 ComputeNodeShutdownActor.949cb086f5bf.compute-1q9g2jmg5askowg-su92l[46608] INFO: Waiting for SLURM node compute36 to drain
2016-04-12_21:55:44.15134 2016-04-12 21:55:44 ComputeNodeShutdownActor.949cb086f5bf.compute-1q9g2jmg5askowg-su92l[46608] INFO: Shutdown cancelled: shutdown window closed.
2016-04-12_21:55:44.15164 2016-04-12 21:55:44 ComputeNodeShutdownActor.949cb086f5bf.compute-1q9g2jmg5askowg-su92l[46608] INFO: finished
2016-04-12_21:55:44.57382 2016-04-12 21:55:44 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:df6d159b-1837-4308-a409-e76748ab7dc6)
2016-04-12_21:55:44.57430 2016-04-12 21:55:44 pykka[46608] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:df6d159b-1837-4308-a409-e76748ab7dc6)
2016-04-12_21:55:44.58096 2016-04-12 21:55:44 ComputeNodeShutdownActor.e76748ab7dc6.compute-rs1ayc75z1atwki-su92l[46608] INFO: Draining SLURM node compute41
2016-04-12_21:55:44.69442 2016-04-12 21:55:44 ComputeNodeShutdownActor.e76748ab7dc6.compute-rs1ayc75z1atwki-su92l[46608] INFO: Waiting for SLURM node compute41 to drain
2016-04-12_21:55:44.75848 2016-04-12 21:55:44 ComputeNodeShutdownActor.e76748ab7dc6.compute-rs1ayc75z1atwki-su92l[46608] INFO: Shutdown cancelled: shutdown window closed.
2016-04-12_21:55:44.75878 2016-04-12 21:55:44 ComputeNodeShutdownActor.e76748ab7dc6.compute-rs1ayc75z1atwki-su92l[46608] INFO: finished
2016-04-12_21:55:45.13360 2016-04-12 21:55:45 ComputeNodeMonitorActor.1dab685fdd8a.compute-u3ps8tsf7ygy0bw-su92l[46608] DEBUG: Cannot shut down because node is not idle.
2016-04-12_21:55:45.32131 2016-04-12 21:55:45 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:955f8a87-54e4-4c11-8dd5-d6359dd39c83)
2016-04-12_21:55:45.32186 2016-04-12 21:55:45 pykka[46608] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:955f8a87-54e4-4c11-8dd5-d6359dd39c83)
2016-04-12_21:55:45.32826 2016-04-12 21:55:45 ComputeNodeShutdownActor.d6359dd39c83.compute-iaiguf1czbby51n-su92l[46608] INFO: Draining SLURM node compute30
2016-04-12_21:55:45.41042 2016-04-12 21:55:45 ComputeNodeShutdownActor.d6359dd39c83.compute-iaiguf1czbby51n-su92l[46608] INFO: Waiting for SLURM node compute30 to drain
2016-04-12_21:55:45.47984 2016-04-12 21:55:45 ComputeNodeShutdownActor.d6359dd39c83.compute-iaiguf1czbby51n-su92l[46608] INFO: Shutdown cancelled: shutdown window closed.
2016-04-12_21:55:45.48136 2016-04-12 21:55:45 ComputeNodeShutdownActor.d6359dd39c83.compute-iaiguf1czbby51n-su92l[46608] INFO: finished
2016-04-12_21:55:45.94573 2016-04-12 21:55:45 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:95f04d93-2173-4214-a811-216cc861cf67)

This gets repeated over and over and over and over again.

A snapshot of sinfo:

 sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite     67 drain* compute[2,11,16,28,31,34,38-39,48,50,54-55,57-58,60,65,67-68,70,78,97-99,101,104-106,109,115,119,122,124,126,129-131,133-136,138,140-144,147,149-151,153-157,162,172-179,181-183]
compute*     up   infinite     74  down* compute[47,158,184-255]
compute*     up   infinite     12  drain compute[87,89-90,94-96,110,113,120,125,127,170]
compute*     up   infinite     57  alloc compute[0-1,3-5,7-10,12-14,17-18,21-27,29,33,43-44,46,61,63-64,66,72-74,76-77,81-82,84-86,88,91-93,100,102-103,107-108,111-112,114,116-118,121,123]
compute*     up   infinite     37   idle compute[6,15,19-20,30,32,35-37,40-42,45,49,51-53,56,59,62,69,71,75,79-80,83,132,137,139,146,148,152,161,163,167,171,180]
compute*     up   infinite      9   down compute[128,145,159-160,164-166,168-169]
crypto       up   infinite     67 drain* compute[2,11,16,28,31,34,38-39,48,50,54-55,57-58,60,65,67-68,70,78,97-99,101,104-106,109,115,119,122,124,126,129-131,133-136,138,140-144,147,149-151,153-157,162,172-179,181-183]
crypto       up   infinite     74  down* compute[47,158,184-255]
crypto       up   infinite     12  drain compute[87,89-90,94-96,110,113,120,125,127,170]
crypto       up   infinite     57  alloc compute[0-1,3-5,7-10,12-14,17-18,21-27,29,33,43-44,46,61,63-64,66,72-74,76-77,81-82,84-86,88,91-93,100,102-103,107-108,111-112,114,116-118,121,123]
crypto       up   infinite     37   idle compute[6,15,19-20,30,32,35-37,40-42,45,49,51-53,56,59,62,69,71,75,79-80,83,132,137,139,146,148,152,161,163,167,171,180]
crypto       up   infinite      9   down compute[128,145,159-160,164-166,168-169]

I'm rolling su92l back to the previous release upgrading node manager to version 0.1.20160412145128 and will report back on this ticket.


Subtasks 1 (0 open1 closed)

Task #8957: Review 8953-node-manager-prevent-shutdown-eligible-flapping-wipResolved04/13/2016Actions

Related issues

Related to Arvados - Idea #8000: [Node Manager] Shut down nodes in SLURM 'down' stateResolvedActions
Actions

Also available in: Atom PDF