Project

General

Profile

Actions

Idea #8543

closed

[NodeManager] Don't use Futures when not expecting a reply

Added by Peter Amstutz about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
03/04/2016
Due date:
Story points:
1.0

Description

Quoting #8437

on_failure is only called when there is no future associated with the message. As it turns out, all calls that use ActorProxy have an associated Future object, and all messaging between actors in node manager uses ActorProxy. This means unhandled exceptions are stored in a Future object to be returned to the caller. However, if the caller never calls get() on the Future object (because it never stored it), this means the exception is silently ignored.

These lingering future objects may also be creating circular references that is causing the memory leak.

Anywhere a message is sent between Actors where a response is not required, such as subscriptions, use tell() on the ActorRef object instead of using ActorProxy (which uses ask()). (May want to write a helper similar to ActorProxy, such as TellActorProxy).

Anticipated benefits:

  • Unhandled exceptions will be logged, and the on_failure method of the Actor be called as intended
  • Possibly uncover previously unknown bugs (previously hidden due to exceptions being lost)
  • May mitigate longstanding memory leak bug by avoiding creating Future objects that are not needed (and which may be actively harmful)

Subtasks 4 (0 open4 closed)

Task #8604: Use TellActorProxy when future is not going to be usedResolvedPeter Amstutz03/04/2016Actions
Task #8602: Review 8543-nodemanager-fewer-futuresResolvedPeter Amstutz03/07/2016Actions
Task #8605: Fix testsResolvedPeter Amstutz03/04/2016Actions
Task #8601: Implement TellActorProxy ResolvedPeter Amstutz03/04/2016Actions

Related issues

Related to Arvados - Bug #8541: [NodeManager] Use sys.exc_clear() to release exception tracebacksClosedActions
Related to Arvados - Bug #7026: [Node Manager] Mishandles stop signalsClosed08/19/2015Actions
Actions #1

Updated by Peter Amstutz about 8 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz about 8 years ago

  • Description updated (diff)
  • Category set to Node Manager
Actions #3

Updated by Brett Smith about 8 years ago

  • Target version set to Arvados Future Sprints
Actions #4

Updated by Brett Smith about 8 years ago

  • Target version changed from Arvados Future Sprints to 2016-03-16 sprint
Actions #5

Updated by Brett Smith about 8 years ago

  • Story points set to 1.0
Actions #6

Updated by Peter Amstutz about 8 years ago

  • Assigned To set to Peter Amstutz
Actions #7

Updated by Tom Clegg about 8 years ago

Looking at 8543-nodemanager-fewer-futures @ bf49d05...

Could you squash this branch and synthesize all the commit messages? I'd like to have less noise in the git history, like commit logs that just say "tests pass", and these changes that got made and then reverted...
  • -        self.subscribers = None
    +        self.subscribers.clear()
    
  •                  del self.shutdowns[key]
    +            if key in self.sizes_booting_shutdown:
                     del self.sizes_booting_shutdown[key]
    

This error message doesn't scan. I assume the "has no attribute" part is a copy/paste error?

            raise AttributeError('%s not a callable on %s has no attribute' % (name, self))

BaseNodeMangerActor should probably be BaseNodeManagerActor

I think it would be worth providing a little more documentation beyond "Like ActorProxy, except only permits asynchronous stuff". Even coming fresh from this story description I find the workings of the "Tell" stuff a bit mysterious. Perhaps it would help to stick a docstring on tell_proxy() saying when one should use that instead of proxy()?

This is a complaint about existing code, not the new code, but I can't help thinking it should be easy/worthwhile to log exception_type and/or exception_value here...
  •             lg.critical("Unhandled exception is a fatal error, killing Node Manager")
    
Actions #8

Updated by Peter Amstutz about 8 years ago

Tom Clegg wrote:

Looking at 8543-nodemanager-fewer-futures @ bf49d05...

Could you squash this branch and synthesize all the commit messages? I'd like to have less noise in the git history, like commit logs that just say "tests pass", and these changes that got made and then reverted...
  • [...]
  • [...]

Rebased and squashed.

This error message doesn't scan. I assume the "has no attribute" part is a copy/paste error?

[...]

Yes I think so.

BaseNodeMangerActor should probably be BaseNodeManagerActor

Thanks for catching that.

I think it would be worth providing a little more documentation beyond "Like ActorProxy, except only permits asynchronous stuff". Even coming fresh from this story description I find the workings of the "Tell" stuff a bit mysterious. Perhaps it would help to stick a docstring on tell_proxy() saying when one should use that instead of proxy()?

I recall now being interrupted in the middle of starting to write that comment and obviously I forgot to come back to it. Expanded docstrings for those clasess.

This is a complaint about existing code, not the new code, but I can't help thinking it should be easy/worthwhile to log exception_type and/or exception_value here...
  • [...]

The exception already gets logged by pykka. It doesn't show up in the tests because the exception gets logged as "ERROR" but the tests by default only log "CRITICAL". Try ANMTEST_LOGLEVEL=ERROR python setup.py test --test-suite=tests.test_failure to see for yourself. In production we currently run at log level INFO or DEBUG.

Actions #9

Updated by Tom Clegg about 8 years ago

LGTM @ 054f950, thanks

Actions #10

Updated by Peter Amstutz about 8 years ago

  • Status changed from New to Resolved

Applied in changeset arvados|commit:2dbbaaefc6a4a46a7f17b9e7799fc455cd722113.

Actions

Also available in: Atom PDF