Bug #7026
closed[Node Manager] Mishandles stop signals
Description
There are two issues here. They might have the same root cause, so for now, I'm filing one issue. If it turns out they don't, it's more important to address the primary bug, and we can split the secondary one off to be dealt with separately.
Primary bug¶
When Node Manager gets a stop signal like SIGINT, the daemon shuts down, along with all the monitors it's created. However, actors started in launcher.main (TimedCallbackActor, the list pollers, the node update actor) never stop, causing the Node Manager process to stay alive.
launcher.main() ends with this code:
signal.pause()
daemon_stopped = node_daemon.actor_ref.actor_stopped.is_set
while not daemon_stopped():
time.sleep(1)
pykka.ActorRegistry.stop_all()
Logs show that the daemon actor stops in Pykka, then many monitor actors shut down after. The daemon actor has no logic to shut down monitor actors. All this makes it look like we're getting to the stop_all()
line. Otherwise, how else would the monitor actors be getting stopped? But other actors apparently aren't being stopped by this method call.
Secondary bug¶
Node Manager is supposed to implement escalating shutdown processes when it gets a stop signal repeatedly, until it eventually forces an exit. See launcher.shutdown_signal(). However, subsequent signals seem to have no effect.