Bug #7026

[Node Manager] Mishandles stop signals

Added by Brett Smith almost 4 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Node Manager
Target version:
Start date:
08/19/2015
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

There are two issues here. They might have the same root cause, so for now, I'm filing one issue. If it turns out they don't, it's more important to address the primary bug, and we can split the secondary one off to be dealt with separately.

Primary bug

When Node Manager gets a stop signal like SIGINT, the daemon shuts down, along with all the monitors it's created. However, actors started in launcher.main (TimedCallbackActor, the list pollers, the node update actor) never stop, causing the Node Manager process to stay alive.

launcher.main() ends with this code:

    signal.pause()
    daemon_stopped = node_daemon.actor_ref.actor_stopped.is_set
    while not daemon_stopped():
        time.sleep(1)
    pykka.ActorRegistry.stop_all()

Logs show that the daemon actor stops in Pykka, then many monitor actors shut down after. The daemon actor has no logic to shut down monitor actors. All this makes it look like we're getting to the stop_all() line. Otherwise, how else would the monitor actors be getting stopped? But other actors apparently aren't being stopped by this method call.

Secondary bug

Node Manager is supposed to implement escalating shutdown processes when it gets a stop signal repeatedly, until it eventually forces an exit. See launcher.shutdown_signal(). However, subsequent signals seem to have no effect.


Related issues

Related to Arvados - Story #8543: [NodeManager] Don't use Futures when not expecting a replyResolved03/04/2016

History

#1 Updated by Brett Smith almost 4 years ago

One possibility: the docs for stop_all say actors are stopped in LIFO order. If that's true, there are two reasons to believe ComputeNodeUpdateActor is the culprit: it sleeps, and it's the last actor started in launcher.main(). If it has a large queue of sync_node requests that all start by sleeping, it could be some time before it finally gets and processes its stop message.

The logs we have are consistent with this theory. After the daemon shuts down, the logs show exceptions like #6225 being raised by the update actor, every 3 minutes (the default maximum wait time between requests in the actor).

If this is the cause, we might need to re-think the implementation of ComputeNodeUpdateActor. We could potentially fix this issue, plus things like #6225 in one fell swoop…

#2 Updated by Nico César over 3 years ago

I don't know if this is the bug. But I guess it is.

here is some actions and symtoms

it doesn't want to die:

manage.qr2hi:/etc/sv# sv stop arvados-node-manager/
timeout: run: arvados-node-manager/: (pid 17243) 240431s, want down, got TERM

in the logs

2016-01-14_11:48:05.60399 2016-01-14 11:48:05 arvnodeman.cloud_nodes[17243] WARNING: CloudNodeListMonitorActor (at 42728272) got error: The read operation timed out - waiting 20 seconds

and node-manager continues spitting logs as usual, but (partially) running- I did a kill -9 and everything is back to normal.

Also available in: Atom PDF