Actions
Bug #8953
closed[Node manager] can not shut down nodes anymore
Story points:
-
Description
Node manager version 0.1.20160410021132-1 (with the drain fix from #8799) appears to have a shutdown loop...
2016-04-12_21:55:42.54948 2016-04-12 21:55:42 ComputeNodeShutdownActor.01df952eda4b.compute-q74ljtuggvufc9l-su92l[46608] INFO: Draining SLURM node compute19 2016-04-12_21:55:42.63657 2016-04-12 21:55:42 ComputeNodeShutdownActor.01df952eda4b.compute-q74ljtuggvufc9l-su92l[46608] INFO: Waiting for SLURM node compute19 to drain 2016-04-12_21:55:42.71020 2016-04-12 21:55:42 ComputeNodeShutdownActor.01df952eda4b.compute-q74ljtuggvufc9l-su92l[46608] INFO: Shutdown cancelled: shutdown window closed. 2016-04-12_21:55:42.71048 2016-04-12 21:55:42 ComputeNodeShutdownActor.01df952eda4b.compute-q74ljtuggvufc9l-su92l[46608] INFO: finished 2016-04-12_21:55:43.14420 2016-04-12 21:55:43 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:58d90fdd-7dee-4219-9ebb-429da8a893ef) 2016-04-12_21:55:43.14471 2016-04-12 21:55:43 pykka[46608] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:58d90fdd-7dee-4219-9ebb-429da8a893ef) 2016-04-12_21:55:43.15100 2016-04-12 21:55:43 ComputeNodeShutdownActor.429da8a893ef.compute-csm26hx92t8pf8o-su92l[46608] INFO: Draining SLURM node compute20 2016-04-12_21:55:43.27655 2016-04-12 21:55:43 ComputeNodeShutdownActor.429da8a893ef.compute-csm26hx92t8pf8o-su92l[46608] INFO: Waiting for SLURM node compute20 to drain 2016-04-12_21:55:43.36943 2016-04-12 21:55:43 ComputeNodeShutdownActor.429da8a893ef.compute-csm26hx92t8pf8o-su92l[46608] INFO: Shutdown cancelled: shutdown window closed. 2016-04-12_21:55:43.36970 2016-04-12 21:55:43 ComputeNodeShutdownActor.429da8a893ef.compute-csm26hx92t8pf8o-su92l[46608] INFO: finished 2016-04-12_21:55:43.87084 2016-04-12 21:55:43 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:f1192c0a-dbad-4bd6-9a70-949cb086f5bf) 2016-04-12_21:55:43.87123 2016-04-12 21:55:43 pykka[46608] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:f1192c0a-dbad-4bd6-9a70-949cb086f5bf) 2016-04-12_21:55:43.87793 2016-04-12 21:55:43 ComputeNodeShutdownActor.949cb086f5bf.compute-1q9g2jmg5askowg-su92l[46608] INFO: Draining SLURM node compute36 2016-04-12_21:55:44.08007 2016-04-12 21:55:44 ComputeNodeShutdownActor.949cb086f5bf.compute-1q9g2jmg5askowg-su92l[46608] INFO: Waiting for SLURM node compute36 to drain 2016-04-12_21:55:44.15134 2016-04-12 21:55:44 ComputeNodeShutdownActor.949cb086f5bf.compute-1q9g2jmg5askowg-su92l[46608] INFO: Shutdown cancelled: shutdown window closed. 2016-04-12_21:55:44.15164 2016-04-12 21:55:44 ComputeNodeShutdownActor.949cb086f5bf.compute-1q9g2jmg5askowg-su92l[46608] INFO: finished 2016-04-12_21:55:44.57382 2016-04-12 21:55:44 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:df6d159b-1837-4308-a409-e76748ab7dc6) 2016-04-12_21:55:44.57430 2016-04-12 21:55:44 pykka[46608] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:df6d159b-1837-4308-a409-e76748ab7dc6) 2016-04-12_21:55:44.58096 2016-04-12 21:55:44 ComputeNodeShutdownActor.e76748ab7dc6.compute-rs1ayc75z1atwki-su92l[46608] INFO: Draining SLURM node compute41 2016-04-12_21:55:44.69442 2016-04-12 21:55:44 ComputeNodeShutdownActor.e76748ab7dc6.compute-rs1ayc75z1atwki-su92l[46608] INFO: Waiting for SLURM node compute41 to drain 2016-04-12_21:55:44.75848 2016-04-12 21:55:44 ComputeNodeShutdownActor.e76748ab7dc6.compute-rs1ayc75z1atwki-su92l[46608] INFO: Shutdown cancelled: shutdown window closed. 2016-04-12_21:55:44.75878 2016-04-12 21:55:44 ComputeNodeShutdownActor.e76748ab7dc6.compute-rs1ayc75z1atwki-su92l[46608] INFO: finished 2016-04-12_21:55:45.13360 2016-04-12 21:55:45 ComputeNodeMonitorActor.1dab685fdd8a.compute-u3ps8tsf7ygy0bw-su92l[46608] DEBUG: Cannot shut down because node is not idle. 2016-04-12_21:55:45.32131 2016-04-12 21:55:45 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:955f8a87-54e4-4c11-8dd5-d6359dd39c83) 2016-04-12_21:55:45.32186 2016-04-12 21:55:45 pykka[46608] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:955f8a87-54e4-4c11-8dd5-d6359dd39c83) 2016-04-12_21:55:45.32826 2016-04-12 21:55:45 ComputeNodeShutdownActor.d6359dd39c83.compute-iaiguf1czbby51n-su92l[46608] INFO: Draining SLURM node compute30 2016-04-12_21:55:45.41042 2016-04-12 21:55:45 ComputeNodeShutdownActor.d6359dd39c83.compute-iaiguf1czbby51n-su92l[46608] INFO: Waiting for SLURM node compute30 to drain 2016-04-12_21:55:45.47984 2016-04-12 21:55:45 ComputeNodeShutdownActor.d6359dd39c83.compute-iaiguf1czbby51n-su92l[46608] INFO: Shutdown cancelled: shutdown window closed. 2016-04-12_21:55:45.48136 2016-04-12 21:55:45 ComputeNodeShutdownActor.d6359dd39c83.compute-iaiguf1czbby51n-su92l[46608] INFO: finished 2016-04-12_21:55:45.94573 2016-04-12 21:55:45 pykka[46608] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:95f04d93-2173-4214-a811-216cc861cf67)
This gets repeated over and over and over and over again.
A snapshot of sinfo:
sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST compute* up infinite 67 drain* compute[2,11,16,28,31,34,38-39,48,50,54-55,57-58,60,65,67-68,70,78,97-99,101,104-106,109,115,119,122,124,126,129-131,133-136,138,140-144,147,149-151,153-157,162,172-179,181-183] compute* up infinite 74 down* compute[47,158,184-255] compute* up infinite 12 drain compute[87,89-90,94-96,110,113,120,125,127,170] compute* up infinite 57 alloc compute[0-1,3-5,7-10,12-14,17-18,21-27,29,33,43-44,46,61,63-64,66,72-74,76-77,81-82,84-86,88,91-93,100,102-103,107-108,111-112,114,116-118,121,123] compute* up infinite 37 idle compute[6,15,19-20,30,32,35-37,40-42,45,49,51-53,56,59,62,69,71,75,79-80,83,132,137,139,146,148,152,161,163,167,171,180] compute* up infinite 9 down compute[128,145,159-160,164-166,168-169] crypto up infinite 67 drain* compute[2,11,16,28,31,34,38-39,48,50,54-55,57-58,60,65,67-68,70,78,97-99,101,104-106,109,115,119,122,124,126,129-131,133-136,138,140-144,147,149-151,153-157,162,172-179,181-183] crypto up infinite 74 down* compute[47,158,184-255] crypto up infinite 12 drain compute[87,89-90,94-96,110,113,120,125,127,170] crypto up infinite 57 alloc compute[0-1,3-5,7-10,12-14,17-18,21-27,29,33,43-44,46,61,63-64,66,72-74,76-77,81-82,84-86,88,91-93,100,102-103,107-108,111-112,114,116-118,121,123] crypto up infinite 37 idle compute[6,15,19-20,30,32,35-37,40-42,45,49,51-53,56,59,62,69,71,75,79-80,83,132,137,139,146,148,152,161,163,167,171,180] crypto up infinite 9 down compute[128,145,159-160,164-166,168-169]
I'm rolling su92l back to the previous release upgrading node manager to version 0.1.20160412145128 and will report back on this ticket.
Actions