Project

General

Profile

Actions

Bug #7026

closed

[Node Manager] Mishandles stop signals

Added by Brett Smith over 8 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
-
Story points:
-

Description

There are two issues here. They might have the same root cause, so for now, I'm filing one issue. If it turns out they don't, it's more important to address the primary bug, and we can split the secondary one off to be dealt with separately.

Primary bug

When Node Manager gets a stop signal like SIGINT, the daemon shuts down, along with all the monitors it's created. However, actors started in launcher.main (TimedCallbackActor, the list pollers, the node update actor) never stop, causing the Node Manager process to stay alive.

launcher.main() ends with this code:

    signal.pause()
    daemon_stopped = node_daemon.actor_ref.actor_stopped.is_set
    while not daemon_stopped():
        time.sleep(1)
    pykka.ActorRegistry.stop_all()

Logs show that the daemon actor stops in Pykka, then many monitor actors shut down after. The daemon actor has no logic to shut down monitor actors. All this makes it look like we're getting to the stop_all() line. Otherwise, how else would the monitor actors be getting stopped? But other actors apparently aren't being stopped by this method call.

Secondary bug

Node Manager is supposed to implement escalating shutdown processes when it gets a stop signal repeatedly, until it eventually forces an exit. See launcher.shutdown_signal(). However, subsequent signals seem to have no effect.


Related issues

Related to Arvados - Idea #8543: [NodeManager] Don't use Futures when not expecting a replyResolvedPeter Amstutz03/04/2016Actions
Actions

Also available in: Atom PDF