Story #8437

[Node Manager] Actors define on_failure to terminate the process on exceptions that are difficult to recover

Added by Brett Smith almost 6 years ago. Updated over 5 years ago.

Assigned To:
Node Manager
Target version:
Start date:
Due date:
% Done:


Estimated time:
(Total: 0.00 h)
Story points:


on_failure gets called when an unhandled exception is raised in an actor method. All our actors should use this to detect when there's an exception that means the whole process is unlikely to recover, and kill it (probably by sending SIGKILL to the process group):

  • threading.ThreadError (can't create thread due to out of RAM)
  • OSError reporting that we can't allocate RAM (hopefully it has errno ENOMEM? Basically we want this to replace the strategy implemented in #6321.)
  • MemoryError

Define a new class that just defines this on_failure method; then have all of the other Node Manager actors use it as their superclass.


Task #8466: Review 8437-nodemanager-on-failureResolvedPeter Amstutz

Task #8524: Add on_failure handlerResolvedPeter Amstutz

Associated revisions

Revision 6ed351ec
Added by Peter Amstutz over 5 years ago

Merge branch '8437-nodemanager-on-failure' closes #8437


#1 Updated by Brett Smith almost 6 years ago

  • Story points set to 1.0

#2 Updated by Brett Smith almost 6 years ago

  • Description updated (diff)

#3 Updated by Peter Amstutz almost 6 years ago

  • Assigned To set to Peter Amstutz

#4 Updated by Peter Amstutz almost 6 years ago

on_failure is only called when there is no future associated with the message. As it turns out, all calls that use ActorProxy have an associated Future object, and all messaging between actors in node manager uses ActorProxy. This means unhandled exceptions are stored in a Future object to be returned to the caller. However, if the caller never calls get() on the Future object (because it never stored it), this means the exception is silently ignored.

These lingering future objects may also be creating circular references that is causing the memory leak.

#5 Updated by Peter Amstutz almost 6 years ago

  • Status changed from New to In Progress

#6 Updated by Nico C├ęsar almost 6 years ago

reviewing 0a1c109684c62f0bc42e7dca30319fc8222dbef7

there are 3 cases where this exception is raise but the test only tests MemoryError

It will be cool to also have threading.ThreadError and OSError with exception_value.errno == errno.ENOMEM down the road this could catch other things

the rest LGTM

#7 Updated by Peter Amstutz over 5 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:6ed351ec65d657c27b48b4e4ac0c89d880a2fd1a.

Also available in: Atom PDF