Story #8543

[NodeManager] Don't use Futures when not expecting a reply

Added by Peter Amstutz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
03/04/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

Quoting #8437

on_failure is only called when there is no future associated with the message. As it turns out, all calls that use ActorProxy have an associated Future object, and all messaging between actors in node manager uses ActorProxy. This means unhandled exceptions are stored in a Future object to be returned to the caller. However, if the caller never calls get() on the Future object (because it never stored it), this means the exception is silently ignored.

These lingering future objects may also be creating circular references that is causing the memory leak.

Anywhere a message is sent between Actors where a response is not required, such as subscriptions, use tell() on the ActorRef object instead of using ActorProxy (which uses ask()). (May want to write a helper similar to ActorProxy, such as TellActorProxy).

Anticipated benefits:

  • Unhandled exceptions will be logged, and the on_failure method of the Actor be called as intended
  • Possibly uncover previously unknown bugs (previously hidden due to exceptions being lost)
  • May mitigate longstanding memory leak bug by avoiding creating Future objects that are not needed (and which may be actively harmful)

Subtasks

Task #8604: Use TellActorProxy when future is not going to be usedResolvedPeter Amstutz

Task #8602: Review 8543-nodemanager-fewer-futuresResolvedPeter Amstutz

Task #8605: Fix testsResolvedPeter Amstutz

Task #8601: Implement TellActorProxy ResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #8541: [NodeManager] Use sys.exc_clear() to release exception tracebacksNew

Related to Arvados - Bug #7026: [Node Manager] Mishandles stop signalsNew08/19/2015

Associated revisions

Revision 2dbbaaef
Added by Peter Amstutz over 3 years ago

Merge branch '8543-nodemanager-fewer-futures' closes #8543

Revision d1de3281 (diff)
Added by Peter Amstutz over 3 years ago

Fix node manager to send ActorProxy instead of TellActorProxy to
ComputeNodeStateChangeBase subscribers. refs #8543

History

#1 Updated by Peter Amstutz over 3 years ago

  • Description updated (diff)

#2 Updated by Peter Amstutz over 3 years ago

  • Description updated (diff)
  • Category set to Node Manager

#3 Updated by Brett Smith over 3 years ago

  • Target version set to Arvados Future Sprints

#4 Updated by Brett Smith over 3 years ago

  • Target version changed from Arvados Future Sprints to 2016-03-16 sprint

#5 Updated by Brett Smith over 3 years ago

  • Story points set to 1.0

#6 Updated by Peter Amstutz over 3 years ago

  • Assigned To set to Peter Amstutz

#7 Updated by Tom Clegg over 3 years ago

Looking at 8543-nodemanager-fewer-futures @ bf49d05...

Could you squash this branch and synthesize all the commit messages? I'd like to have less noise in the git history, like commit logs that just say "tests pass", and these changes that got made and then reverted...
  • -        self.subscribers = None
    +        self.subscribers.clear()
    
  •                  del self.shutdowns[key]
    +            if key in self.sizes_booting_shutdown:
                     del self.sizes_booting_shutdown[key]
    

This error message doesn't scan. I assume the "has no attribute" part is a copy/paste error?

            raise AttributeError('%s not a callable on %s has no attribute' % (name, self))

BaseNodeMangerActor should probably be BaseNodeManagerActor

I think it would be worth providing a little more documentation beyond "Like ActorProxy, except only permits asynchronous stuff". Even coming fresh from this story description I find the workings of the "Tell" stuff a bit mysterious. Perhaps it would help to stick a docstring on tell_proxy() saying when one should use that instead of proxy()?

This is a complaint about existing code, not the new code, but I can't help thinking it should be easy/worthwhile to log exception_type and/or exception_value here...
  •             lg.critical("Unhandled exception is a fatal error, killing Node Manager")
    

#8 Updated by Peter Amstutz over 3 years ago

Tom Clegg wrote:

Looking at 8543-nodemanager-fewer-futures @ bf49d05...

Could you squash this branch and synthesize all the commit messages? I'd like to have less noise in the git history, like commit logs that just say "tests pass", and these changes that got made and then reverted...
  • [...]
  • [...]

Rebased and squashed.

This error message doesn't scan. I assume the "has no attribute" part is a copy/paste error?

[...]

Yes I think so.

BaseNodeMangerActor should probably be BaseNodeManagerActor

Thanks for catching that.

I think it would be worth providing a little more documentation beyond "Like ActorProxy, except only permits asynchronous stuff". Even coming fresh from this story description I find the workings of the "Tell" stuff a bit mysterious. Perhaps it would help to stick a docstring on tell_proxy() saying when one should use that instead of proxy()?

I recall now being interrupted in the middle of starting to write that comment and obviously I forgot to come back to it. Expanded docstrings for those clasess.

This is a complaint about existing code, not the new code, but I can't help thinking it should be easy/worthwhile to log exception_type and/or exception_value here...
  • [...]

The exception already gets logged by pykka. It doesn't show up in the tests because the exception gets logged as "ERROR" but the tests by default only log "CRITICAL". Try ANMTEST_LOGLEVEL=ERROR python setup.py test --test-suite=tests.test_failure to see for yourself. In production we currently run at log level INFO or DEBUG.

#9 Updated by Tom Clegg over 3 years ago

LGTM @ 054f950, thanks

#10 Updated by Peter Amstutz over 3 years ago

  • Status changed from New to Resolved

Applied in changeset arvados|commit:2dbbaaefc6a4a46a7f17b9e7799fc455cd722113.

Also available in: Atom PDF