Project

General

Profile

Actions

Bug #8932

closed

[Node manager] Always crash on_failure()

Added by Peter Amstutz about 8 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

Currently Node manager kills itself on certain types of actor failure:

    def on_failure(self, exception_type, exception_value, tb):
        lg = getattr(self, "_logger", logging)
        if (exception_type in (threading.ThreadError, MemoryError) or
            exception_type is OSError and exception_value.errno == errno.ENOMEM):
            lg.critical("Unhandled exception is a fatal error, killing Node Manager")
            os.killpg(os.getpgid(0), 9)

However, experience suggests that unexpected/unhandled actor failure (which stops the actor) usually causes node manager to misbehave (at best) or wedges node manager completely (at worst). Especially now that #8799 is merged (so node manager can recover when a shutdown actor is interrupted), I propose that node manager should kill itself on all unhandled exceptions.

Actions #1

Updated by Peter Amstutz about 8 years ago

  • Description updated (diff)
Actions #2

Updated by Brett Smith about 8 years ago

This is one of those places where there's a funny interaction between the code and ops.

Purely from a code perspective, I agree this is a good idea. Node Manager currently doesn't have good recovery strategies for its actors dying unexpectedly. Until it does, it makes sense for an unhandled exception to kill the whole process.

But with our current deployment strategy, killing the process just means that runit is going to restart it immediately, and it's less obvious to me we want that. Definitely there are scenarios where it will help cluster stability, and that's good. But it's also easy to imagine scenarios where a pure bug causes regular Node Manager restarts that make a bad situation worse.

Plus, while it definitely sucks when Node Manager is wedged this way, that suck keeps us honest about bug fixes. This change may make it harder to track down the original bug, because more time can pass between when it happens and when we notice it, and there will likely be more Node Manager activity in that intervening period.

I'd be interested in ops' take on this idea.

Actions #3

Updated by Peter Amstutz over 4 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF