Bug #7435
closed[Node manager] ShutdownActor dies when its paired MonitorActor goes away
Description
2015-10-01_18:29:01.92223 2015-10-01 18:29:01 pykka[44299] DEBUG: Registered ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f) 2015-10-01_18:29:01.92230 2015-10-01 18:29:01 pykka[44299] DEBUG: Starting ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f) 2015-10-01_18:33:24.61056 2015-10-01 18:33:24 pykka[44299] DEBUG: Exception returned from ComputeNodeShutdownActor (urn:uuid:e675d76d-15e2-4c6d-8617-253041b9c42f) to caller: 2015-10-01_18:33:24.61060 Traceback (most recent call last): 2015-10-01_18:33:24.61061 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop 2015-10-01_18:33:24.61061 response = self._handle_receive(message) 2015-10-01_18:33:24.61062 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive 2015-10-01_18:33:24.61063 return callee(*message['args'], **message['kwargs']) 2015-10-01_18:33:24.61063 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 190, in stop_wrapper 2015-10-01_18:33:24.61064 (not self._monitor.shutdown_eligible().get())): 2015-10-01_18:33:24.61067 File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 299, in get 2015-10-01_18:33:24.61067 exec('raise exc_info[0], exc_info[1], exc_info[2]') 2015-10-01_18:33:24.61068 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 470, in ask 2015-10-01_18:33:24.61069 self.tell(message) 2015-10-01_18:33:24.61069 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 437, in tell 2015-10-01_18:33:24.61070 raise _ActorDeadError('%s not found' % self) 2015-10-01_18:33:24.61071 ActorDeadError: ComputeNodeMonitorActor (urn:uuid:3c2daeb6-1ab2-4024-b121-5380d136b234) not found
Related issues
Updated by Brett Smith about 9 years ago
Actors "die" after they finish handling a normal stop signal, or if they raise an unhandled exception.
This backtrace seems to indicate that a shutdown actor couldn't talk to the monitor for its node. Based on the timestamps, it looks like we started the shutdown actor, and then the node vanished from the cloud—and hence we shut down its monitor—before the shutdown actor finished its work.
Probably the fix here is for the daemon actor to stop any associated shutdown actor when a node gets unlisted, just before it stops the monitor actor.
Updated by Brett Smith about 9 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith about 9 years ago
- Subject changed from [Node manager] ActorDeadError to [Node manager] ShutdownActor dies when its paired MonitorActor goes away
- Category set to Node Manager
Updated by Brett Smith about 9 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
Updated by Brett Smith about 9 years ago
- Target version changed from Arvados Future Sprints to 2015-10-14 sprint
- Story points set to 0.5
Brett Smith wrote:
Probably the fix here is for the daemon actor to stop any associated shutdown actor when a node gets unlisted, just before it stops the monitor actor.
7435-node-manager-shutdown-cleanup-wip implements this fix and is up for review.
Updated by Peter Amstutz about 9 years ago
I did a little research on Pykka actors; it looks like stop()
is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.
Updated by Brett Smith about 9 years ago
Peter Amstutz wrote:
I did a little research on Pykka actors; it looks like
stop()
is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.
The stop method of ActorRef blocks, but the stop method of ActorProxy acts like any other proxy method: it returns a Future, and you have to block on that if that's what you want.
I thought about this when I was writing the code, and I thought Pykka's ordering guarantees for delivering messages would prevent this from coming up. But (a) I can't find the cite for that, so I shouldn't be trusted, and (b) I'm pretty sure I'm misremembering it anyway, and it guarantees order delivery when two actors are talking to a third, rather than one actor talking to two others.
053de78 blocks, adds a test, and rebases on top of current master. Please take a look. Thanks.
Updated by Peter Amstutz about 9 years ago
Brett Smith wrote:
Peter Amstutz wrote:
I did a little research on Pykka actors; it looks like
stop()
is blocking, so the shutdown actor should be completely stopped before the monitor actor stops, so it avoids the bug. Looks good to me.The stop method of ActorRef blocks, but the stop method of ActorProxy acts like any other proxy method: it returns a Future, and you have to block on that if that's what you want.
Okay, good, I'm glad you gave this one another look. I found this a bit confusing, because Actor.stop()
defaults to block=False
, ActorRef.stop()
defaults to block=True
, so I assumed ActorProxy.stop()
delegated to ActorRef
but it sounds like that's not the case.
I thought about this when I was writing the code, and I thought Pykka's ordering guarantees for delivering messages would prevent this from coming up. But (a) I can't find the cite for that, so I shouldn't be trusted, and (b) I'm pretty sure I'm misremembering it anyway, and it guarantees order delivery when two actors are talking to a third, rather than one actor talking to two others.
053de78 blocks, adds a test, and rebases on top of current master. Please take a look. Thanks.
This one looks good to me.
Updated by Brett Smith about 9 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:71db992331f357fdb3a4fdbca42a9952b7e9ae2c.