Bug #7435

[Node manager] ShutdownActor dies when its paired MonitorActor goes away

Added by Peter Amstutz over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Node Manager
Target version:
Start date:
10/02/2015
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
0.5

Description

2015-10-01_18:29:01.92223 2015-10-01 18:29:01 pykka[44299] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f)
2015-10-01_18:29:01.92230 2015-10-01 18:29:01 pykka[44299] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f)

2015-10-01_18:33:24.61056 2015-10-01 18:33:24 pykka[44299] DEBUG: Exception returned from ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f) to caller:
2015-10-01_18:33:24.61060 Traceback (most recent call last):
2015-10-01_18:33:24.61061   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop
2015-10-01_18:33:24.61061     response = self._handle_receive(message)
2015-10-01_18:33:24.61062   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive
2015-10-01_18:33:24.61063     return callee(*message['args'], **message['kwargs'])
2015-10-01_18:33:24.61063   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 190, in stop_wrapper
2015-10-01_18:33:24.61064     (not self._monitor.shutdown_eligible().get())):
2015-10-01_18:33:24.61067   File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 299, in get
2015-10-01_18:33:24.61067     exec('raise exc_info[0], exc_info[1], exc_info[2]')
2015-10-01_18:33:24.61068   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 470, in ask
2015-10-01_18:33:24.61069     self.tell(message)
2015-10-01_18:33:24.61069   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 437, in tell
2015-10-01_18:33:24.61070     raise _ActorDeadError('%s not found' % self)
2015-10-01_18:33:24.61071 ActorDeadError: ComputeNodeMonitorActor (urn:uuid:3c2daeb6-1ab2-4024-b121-5380d136b234) not found

Subtasks

Task #7445: Review 7435-node-manager-shutdown-cleanup-wipResolvedPeter Amstutz


Related issues

Has duplicate Arvados - Bug #4622: [Node Manager] Stop ShutdownActors when the node disappearsDuplicate11/19/2014

Associated revisions

Revision 71db9923
Added by Brett Smith over 4 years ago

Merge branch '7435-node-manager-shutdown-cleanup-wip'

Closes #7435, #7445.

History

#1 Updated by Peter Amstutz over 4 years ago

  • Description updated (diff)

#2 Updated by Brett Smith over 4 years ago

Actors "die" after they finish handling a normal stop signal, or if they raise an unhandled exception.

This backtrace seems to indicate that a shutdown actor couldn't talk to the monitor for its node. Based on the timestamps, it looks like we started the shutdown actor, and then the node vanished from the cloud—and hence we shut down its monitor—before the shutdown actor finished its work.

Probably the fix here is for the daemon actor to stop any associated shutdown actor when a node gets unlisted, just before it stops the monitor actor.

#3 Updated by Brett Smith over 4 years ago

  • Target version set to Arvados Future Sprints

#4 Updated by Brett Smith over 4 years ago

  • Subject changed from [Node manager] ActorDeadError to [Node manager] ShutdownActor dies when its paired MonitorActor goes away
  • Category set to Node Manager

#5 Updated by Brett Smith over 4 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith

#6 Updated by Brett Smith over 4 years ago

  • Target version changed from Arvados Future Sprints to 2015-10-14 sprint
  • Story points set to 0.5

Brett Smith wrote:

Probably the fix here is for the daemon actor to stop any associated shutdown actor when a node gets unlisted, just before it stops the monitor actor.

7435-node-manager-shutdown-cleanup-wip implements this fix and is up for review.

#7 Updated by Peter Amstutz over 4 years ago

I did a little research on Pykka actors; it looks like stop() is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.

#8 Updated by Brett Smith over 4 years ago

Peter Amstutz wrote:

I did a little research on Pykka actors; it looks like stop() is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.

The stop method of ActorRef blocks, but the stop method of ActorProxy acts like any other proxy method: it returns a Future, and you have to block on that if that's what you want.

I thought about this when I was writing the code, and I thought Pykka's ordering guarantees for delivering messages would prevent this from coming up. But (a) I can't find the cite for that, so I shouldn't be trusted, and (b) I'm pretty sure I'm misremembering it anyway, and it guarantees order delivery when two actors are talking to a third, rather than one actor talking to two others.

053de78 blocks, adds a test, and rebases on top of current master. Please take a look. Thanks.

#9 Updated by Peter Amstutz over 4 years ago

Brett Smith wrote:

Peter Amstutz wrote:

I did a little research on Pykka actors; it looks like stop() is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.

The stop method of ActorRef blocks, but the stop method of ActorProxy acts like any other proxy method: it returns a Future, and you have to block on that if that's what you want.

Okay, good, I'm glad you gave this one another look. I found this a bit confusing, because Actor.stop() defaults to block=False, ActorRef.stop() defaults to block=True, so I assumed ActorProxy.stop() delegated to ActorRef but it sounds like that's not the case.

I thought about this when I was writing the code, and I thought Pykka's ordering guarantees for delivering messages would prevent this from coming up. But (a) I can't find the cite for that, so I shouldn't be trusted, and (b) I'm pretty sure I'm misremembering it anyway, and it guarantees order delivery when two actors are talking to a third, rather than one actor talking to two others.

053de78 blocks, adds a test, and rebases on top of current master. Please take a look. Thanks.

This one looks good to me.

#10 Updated by Brett Smith over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:71db992331f357fdb3a4fdbca42a9952b7e9ae2c.

Also available in: Atom PDF