Actions
Bug #8871
closed[Node Manager] Doesn't kill itself when unhandled exceptions are raised during actor setup
Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-
Description
I've seen this a few times now on su92l:
2016-04-01_20:52:05.13582 2016-04-01 20:52:05 ArvadosNodeListMonitorActor.140095751503264[51443] DEBUG: sending request 2016-04-01_20:52:05.18295 2016-04-01 20:52:05 JobQueueMonitorActor.140095749474224[51443] DEBUG: Calculated wishlist: Standard_D3_v2, Standard_D3_v2 2016-04-01_20:52:05.18486 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D14_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-01_20:52:05.18493 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-01_20:52:05.18498 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-01_20:52:05.18536 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-01_20:52:05.18543 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-01_20:52:05.19228 2016-04-01 20:52:05 JobQueueMonitorActor.140095749474224[51443] INFO: got response with 2 items in 0.0693888664246 seconds, next poll at 2016-04-01 20:52:15 2016-04-01_20:52:05.19665 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D3_v2: wishlist 2, up 10 (booting 0, idle 1, busy 9), shutting down 0 2016-04-01_20:52:05.19694 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-01_20:52:05.19718 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-01_20:52:05.19741 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-01_20:52:05.20614 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D3_v2: wishlist 2, up 10 (booting 0, idle 1, busy 9), shutting down 0 2016-04-01_20:52:05.21357 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Want 1 more Standard_D3_v2 nodes. Booting a node. 2016-04-01_20:52:05.70621 2016-04-01 20:52:05 ArvadosNodeListMonitorActor.140095751503264[51443] INFO: got response with 262 items in 0.576020002365 seconds, next poll at 2016-04-01 20:52:15 2016-04-01_20:52:05.76385 2016-04-01 20:52:05 pykka[51443] DEBUG: Registered ComputeNodeSetupActor (urn:uuid:d8b01ed6-a92c-4f92-ba80-ecdaee8b5717) 2016-04-01_20:52:05.76485 2016-04-01 20:52:05 pykka[51443] DEBUG: Starting ComputeNodeSetupActor (urn:uuid:d8b01ed6-a92c-4f92-ba80-ecdaee8b5717) 2016-04-01_20:52:05.76747 2016-04-01 20:52:05 pykka[51443] ERROR: Unhandled exception in NodeManagerDaemonActor (urn:uuid:902adcfd-aefa-43f1-82f2-df480ec63ec2): 2016-04-01_20:52:05.76754 Traceback (most recent call last): 2016-04-01_20:52:05.76759 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop 2016-04-01_20:52:05.76763 response = self._handle_receive(message) 2016-04-01_20:52:05.76777 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive 2016-04-01_20:52:05.76784 return callee(*message['args'], **message['kwargs']) 2016-04-01_20:52:05.76789 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 346, in wrapper 2016-04-01_20:52:05.76797 return orig_func(self, *args, **kwargs) 2016-04-01_20:52:05.76808 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 364, in start_node 2016-04-01_20:52:05.76813 cloud_size=cloud_size).proxy() 2016-04-01_20:52:05.76816 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 99, in start 2016-04-01_20:52:05.76820 obj._start_actor_loop() 2016-04-01_20:52:05.76823 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 367, in _start_actor_loop 2016-04-01_20:52:05.76826 thread.start() 2016-04-01_20:52:05.76835 File "/usr/lib/python2.7/threading.py", line 745, in start 2016-04-01_20:52:05.76839 _start_new_thread(self.__bootstrap, ()) 2016-04-01_20:52:05.76842 error: can't start new thread 2016-04-01_20:52:05.76864 2016-04-01 20:52:05 pykka[51443] DEBUG: Unregistered NodeManagerDaemonActor (urn:uuid:902adcfd-aefa-43f1-82f2-df480ec63ec2) 2016-04-01_20:52:05.76881 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] CRITICAL: Unhandled exception is a fatal error, killing Node Manager tail: ‘arvados-node-manager/log/main/current’ has become inaccessible: No such file or directory tail: ‘arvados-node-manager/log/main/current’ has appeared; following end of new file 2016-04-01_20:52:06.99188 Starting arvados-node-manager from /etc/sv/arvados-node-manager 2016-04-01_20:52:08.44991 2016-04-01 20:52:08 pykka[52648] DEBUG: Registered TimedCallBackActor (urn:uuid:efe8550c-dcc1-4bda-bc3f-ebf2bdfdef17) 2016-04-01_20:52:08.45017 2016-04-01 20:52:08 pykka[52648] DEBUG: Starting TimedCallBackActor (urn:uuid:efe8550c-dcc1-4bda-bc3f-ebf2bdfdef17) 2016-04-01_20:52:09.18461 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered CloudNodeListMonitorActor (urn:uuid:289e7dfd-7149-4fe4-ba2d-b0c1079652cd) 2016-04-01_20:52:09.18487 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting CloudNodeListMonitorActor (urn:uuid:289e7dfd-7149-4fe4-ba2d-b0c1079652cd) 2016-04-01_20:52:09.20825 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered ArvadosNodeListMonitorActor (urn:uuid:da3a72bd-6dad-43d9-a133-ec5a8612393a) 2016-04-01_20:52:09.20845 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting ArvadosNodeListMonitorActor (urn:uuid:da3a72bd-6dad-43d9-a133-ec5a8612393a) 2016-04-01_20:52:09.22401 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered JobQueueMonitorActor (urn:uuid:11ec074d-1e15-4435-bcb5-48761a6b886d) 2016-04-01_20:52:09.22421 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting JobQueueMonitorActor (urn:uuid:11ec074d-1e15-4435-bcb5-48761a6b886d) 2016-04-01_20:52:09.22504 2016-04-01 20:52:09 root[52648] ERROR: Uncaught exception during setup 2016-04-01_20:52:09.22510 Traceback (most recent call last): 2016-04-01_20:52:09.22514 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 112, in main 2016-04-01_20:52:09.22517 launch_pollers(config, server_calculator) 2016-04-01_20:52:09.22521 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 79, in launch_pollers 2016-04-01_20:52:09.22524 poll_time, max_poll_time).tell_proxy() 2016-04-01_20:52:09.22527 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 99, in start 2016-04-01_20:52:09.22534 obj._start_actor_loop() 2016-04-01_20:52:09.22538 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 367, in _start_actor_loop 2016-04-01_20:52:09.22541 thread.start() 2016-04-01_20:52:09.22545 File "/usr/lib/python2.7/threading.py", line 745, in start 2016-04-01_20:52:09.22548 _start_new_thread(self.__bootstrap, ()) 2016-04-01_20:52:09.22551 error: can't start new thread
And at that point, node manager just hangs.
Related issues
Updated by Brett Smith about 8 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith about 8 years ago
- Subject changed from [Nodemanager] doesn't always kill itself when it can't create a thread to [Node Manager] Doesn't kill itself when unhandled exceptions are raised during actor setup
We implemented the death logic as an actor method that gets called when an unhandled exception is raised during message handling. This is happening before that: the actor can't even be created, because we can't spawn the thread to run it.
For the actors that get started at boot, those exceptions can be caught and stop the process inside the launcher function. Dealing with it for actors that get spawned under the Daemon Actor is a little trickier and needs more discussion.
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)
Actions