Project

General

Profile

Actions

Idea #8437

closed

[Node Manager] Actors define on_failure to terminate the process on exceptions that are difficult to recover

Added by Brett Smith about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
02/16/2016
Due date:
Story points:
1.0

Description

on_failure gets called when an unhandled exception is raised in an actor method. All our actors should use this to detect when there's an exception that means the whole process is unlikely to recover, and kill it (probably by sending SIGKILL to the process group):

  • threading.ThreadError (can't create thread due to out of RAM)
  • OSError reporting that we can't allocate RAM (hopefully it has errno ENOMEM? Basically we want this to replace the strategy implemented in #6321.)
  • MemoryError

Define a new class that just defines this on_failure method; then have all of the other Node Manager actors use it as their superclass.


Subtasks 2 (0 open2 closed)

Task #8466: Review 8437-nodemanager-on-failureResolvedPeter Amstutz02/16/2016Actions
Task #8524: Add on_failure handlerResolvedPeter Amstutz02/16/2016Actions
Actions #1

Updated by Brett Smith about 8 years ago

  • Story points set to 1.0
Actions #2

Updated by Brett Smith about 8 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 8 years ago

  • Assigned To set to Peter Amstutz
Actions #4

Updated by Peter Amstutz about 8 years ago

on_failure is only called when there is no future associated with the message. As it turns out, all calls that use ActorProxy have an associated Future object, and all messaging between actors in node manager uses ActorProxy. This means unhandled exceptions are stored in a Future object to be returned to the caller. However, if the caller never calls get() on the Future object (because it never stored it), this means the exception is silently ignored.

These lingering future objects may also be creating circular references that is causing the memory leak.

Actions #5

Updated by Peter Amstutz about 8 years ago

  • Status changed from New to In Progress
Actions #6

Updated by Nico César about 8 years ago

reviewing 0a1c109684c62f0bc42e7dca30319fc8222dbef7

there are 3 cases where this exception is raise but the test only tests MemoryError

It will be cool to also have threading.ThreadError and OSError with exception_value.errno == errno.ENOMEM down the road this could catch other things

the rest LGTM

Actions #7

Updated by Peter Amstutz about 8 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:6ed351ec65d657c27b48b4e4ac0c89d880a2fd1a.

Actions

Also available in: Atom PDF