Bug #8932
Updated by Peter Amstutz almost 9 years ago
Currently Node manager kills itself on certain types of actor failure:
<pre>
def on_failure(self, exception_type, exception_value, tb):
lg = getattr(self, "_logger", logging)
if (exception_type in (threading.ThreadError, MemoryError) or
exception_type is OSError and exception_value.errno == errno.ENOMEM):
lg.critical("Unhandled exception is a fatal error, killing Node Manager")
os.killpg(os.getpgid(0), 9)
</pre>
However, experience suggests that unexpected/unhandled actor failure (which stops the actor) usually causes node manager to misbehave (at best) or wedges node manager completely (at worst). Especially now that #8799 is merged (so node manager can recover when a shutdown actor is interrupted), I propose that node manager should kill itself on all unhandled exceptions.