Bug #8687

[Nodemanager] ComputeNodeShutdownActor dies.

Added by Nico César over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
03/14/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

on version 0.1.20160310205427 we got the following exception

nodemanager.wx7k5:/etc/sv# zgrep exception -i arvados-node-manager/log/main/@4000000056e7018b0d9856cc.s -A20
2016-03-14_17:57:29.63760 2016-03-14 17:57:29 pykka[32674] ERROR: Unhandled exception in NodeManagerDaemonActor (urn:uuid:8775e92b-aa32-47a9-86b6-fdf8fd9637a6):
2016-03-14_17:57:29.63764 Traceback (most recent call last):
2016-03-14_17:57:29.63765   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop
2016-03-14_17:57:29.63766     response = self._handle_receive(message)
2016-03-14_17:57:29.63767   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive
2016-03-14_17:57:29.63768     return callee(*message['args'], **message['kwargs'])
2016-03-14_17:57:29.63769   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 346, in wrapper
2016-03-14_17:57:29.63770     return orig_func(self, *args, **kwargs)
2016-03-14_17:57:29.63770   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 422, in node_can_shutdown
2016-03-14_17:57:29.63771     self._begin_node_shutdown(node_actor, cancellable=True)
2016-03-14_17:57:29.63772   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 417, in _begin_node_shutdown
2016-03-14_17:57:29.63773     shutdown.tell_proxy().subscribe(self._later.node_finished_shutdown)
2016-03-14_17:57:29.63774   File "/usr/lib/python2.7/dist-packages/pykka/proxy.py", line 161, in __getattr__
2016-03-14_17:57:29.63775     raise AttributeError('%s has no attribute "%s"' % (self, name))
2016-03-14_17:57:29.63777 AttributeError: <ActorProxy for ComputeNodeShutdownActor (urn:uuid:2aad9ea9-1b07-43d2-9beb-bd26d6d22686), attr_path=()> has no attribute "tell_proxy" 
2016-03-14_17:57:29.63810 2016-03-14 17:57:29 ComputeNodeShutdownActor.bd26d6d22686.compute-9ykvyjo8btbaolg-wx7k5[32674] INFO: Draining SLURM node compute0
2016-03-14_17:57:29.63829 2016-03-14 17:57:29 pykka[32674] DEBUG: Unregistered NodeManagerDaemonActor (urn:uuid:8775e92b-aa32-47a9-86b6-fdf8fd9637a6)
2016-03-14_17:57:29.66277 2016-03-14 17:57:29 ComputeNodeShutdownActor.bd26d6d22686.compute-9ykvyjo8btbaolg-wx7k5[32674] INFO: Waiting for SLURM node compute0 to drain
2016-03-14_17:57:29.69269 2016-03-14 17:57:29 ComputeNodeShutdownActor.bd26d6d22686.compute-9ykvyjo8btbaolg-wx7k5[32674] INFO: Starting shutdown
2016-03-14_17:57:30.29366 2016-03-14 17:57:30 ArvadosNodeListMonitorActor.140560520064576[32674] INFO: got response with 393 items in 0.870521783829 seconds, next poll at 2016-03-14 17:57:39
2016-03-14_17:57:32.34714 2016-03-14 17:57:32 CloudNodeListMonitorActor.140563744302144[32674] INFO: got response with 1 items in 24.9102361202 seconds, next poll at 2016-03-14 17:57:17
--
2016-03-14_18:22:26.97211 2016-03-14 18:22:26 root[32674] ERROR: Uncaught exception during setup
2016-03-14_18:22:26.97213 Traceback (most recent call last):
2016-03-14_18:22:26.97214   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 128, in main
2016-03-14_18:22:26.97214     signal.pause()
2016-03-14_18:22:26.97215   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 90, in shutdown_signal
2016-03-14_18:22:26.97215     node_daemon.shutdown()
2016-03-14_18:22:26.97216   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/baseactor.py", line 25, in __call__
2016-03-14_18:22:26.97216     self.actor_ref.tell(message)
2016-03-14_18:22:26.97217   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 437, in tell
2016-03-14_18:22:26.97217     raise _ActorDeadError('%s not found' % self)
2016-03-14_18:22:26.97218 ActorDeadError: NodeManagerDaemonActor (urn:uuid:8775e92b-aa32-47a9-86b6-fdf8fd9637a6) not found

We upgraded wx7k5 to 0.1.20160311203330 hoping that this is related to #8678 (which presumably has been fixed) but we don't know.

feel free to mark this ticket as duplicate if so.

@4000000056e7018b0d9856cc.s (70.6 KB) @4000000056e7018b0d9856cc.s Nico César, 03/14/2016 06:30 PM

History

#1 Updated by Nico César over 5 years ago

  • Description updated (diff)

#2 Updated by Nico César over 5 years ago

  • Subject changed from [NODEMANAGER] NodeManagerDaemonActor dies. to [NODEMANAGER] ComputeNodeShutdownActor dies.

#3 Updated by Ward Vandewege over 5 years ago

  • Subject changed from [NODEMANAGER] ComputeNodeShutdownActor dies. to [Nodemanager] ComputeNodeShutdownActor dies.

#4 Updated by Ward Vandewege over 5 years ago

  • Status changed from New to Resolved
  • Target version set to 2016-03-16 sprint

We think this was fixed in 94b8484.

Also available in: Atom PDF