Bug #7435
closed
[Node manager] ShutdownActor dies when its paired MonitorActor goes away
Added by Peter Amstutz about 9 years ago.
Updated about 9 years ago.
Description
2015-10-01_18:29:01.92223 2015-10-01 18:29:01 pykka[44299] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f)
2015-10-01_18:29:01.92230 2015-10-01 18:29:01 pykka[44299] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f)
2015-10-01_18:33:24.61056 2015-10-01 18:33:24 pykka[44299] DEBUG: Exception returned from ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f) to caller:
2015-10-01_18:33:24.61060 Traceback (most recent call last):
2015-10-01_18:33:24.61061 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop
2015-10-01_18:33:24.61061 response = self._handle_receive(message)
2015-10-01_18:33:24.61062 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive
2015-10-01_18:33:24.61063 return callee(*message['args'], **message['kwargs'])
2015-10-01_18:33:24.61063 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 190, in stop_wrapper
2015-10-01_18:33:24.61064 (not self._monitor.shutdown_eligible().get())):
2015-10-01_18:33:24.61067 File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 299, in get
2015-10-01_18:33:24.61067 exec('raise exc_info[0], exc_info[1], exc_info[2]')
2015-10-01_18:33:24.61068 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 470, in ask
2015-10-01_18:33:24.61069 self.tell(message)
2015-10-01_18:33:24.61069 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 437, in tell
2015-10-01_18:33:24.61070 raise _ActorDeadError('%s not found' % self)
2015-10-01_18:33:24.61071 ActorDeadError: ComputeNodeMonitorActor (urn:uuid:3c2daeb6-1ab2-4024-b121-5380d136b234) not found
- Description updated (diff)
Actors "die" after they finish handling a normal stop signal, or if they raise an unhandled exception.
This backtrace seems to indicate that a shutdown actor couldn't talk to the monitor for its node. Based on the timestamps, it looks like we started the shutdown actor, and then the node vanished from the cloud—and hence we shut down its monitor—before the shutdown actor finished its work.
Probably the fix here is for the daemon actor to stop any associated shutdown actor when a node gets unlisted, just before it stops the monitor actor.
- Target version set to Arvados Future Sprints
- Subject changed from [Node manager] ActorDeadError to [Node manager] ShutdownActor dies when its paired MonitorActor goes away
- Category set to Node Manager
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2015-10-14 sprint
- Story points set to 0.5
Brett Smith wrote:
Probably the fix here is for the daemon actor to stop any associated shutdown actor when a node gets unlisted, just before it stops the monitor actor.
7435-node-manager-shutdown-cleanup-wip implements this fix and is up for review.
I did a little research on Pykka actors; it looks like stop()
is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.
Peter Amstutz wrote:
I did a little research on Pykka actors; it looks like stop()
is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.
The stop method of ActorRef blocks, but the stop method of ActorProxy acts like any other proxy method: it returns a Future, and you have to block on that if that's what you want.
I thought about this when I was writing the code, and I thought Pykka's ordering guarantees for delivering messages would prevent this from coming up. But (a) I can't find the cite for that, so I shouldn't be trusted, and (b) I'm pretty sure I'm misremembering it anyway, and it guarantees order delivery when two actors are talking to a third, rather than one actor talking to two others.
053de78 blocks, adds a test, and rebases on top of current master. Please take a look. Thanks.
Brett Smith wrote:
Peter Amstutz wrote:
I did a little research on Pykka actors; it looks like stop()
is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.
The stop method of ActorRef blocks, but the stop method of ActorProxy acts like any other proxy method: it returns a Future, and you have to block on that if that's what you want.
Okay, good, I'm glad you gave this one another look. I found this a bit confusing, because Actor.stop()
defaults to block=False
, ActorRef.stop()
defaults to block=True
, so I assumed ActorProxy.stop()
delegated to ActorRef
but it sounds like that's not the case.
I thought about this when I was writing the code, and I thought Pykka's ordering guarantees for delivering messages would prevent this from coming up. But (a) I can't find the cite for that, so I shouldn't be trusted, and (b) I'm pretty sure I'm misremembering it anyway, and it guarantees order delivery when two actors are talking to a third, rather than one actor talking to two others.
053de78 blocks, adds a test, and rebases on top of current master. Please take a look. Thanks.
This one looks good to me.
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:71db992331f357fdb3a4fdbca42a9952b7e9ae2c.
Also available in: Atom
PDF