Bug #8871

[Node Manager] Doesn't kill itself when unhandled exceptions are raised during actor setup

Added by Ward Vandewege over 4 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

I've seen this a few times now on su92l:

2016-04-01_20:52:05.13582 2016-04-01 20:52:05 ArvadosNodeListMonitorActor.140095751503264[51443] DEBUG: sending request
2016-04-01_20:52:05.18295 2016-04-01 20:52:05 JobQueueMonitorActor.140095749474224[51443] DEBUG: Calculated wishlist: Standard_D3_v2, Standard_D3_v2
2016-04-01_20:52:05.18486 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D14_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.18493 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.18498 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.18536 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.18543 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.19228 2016-04-01 20:52:05 JobQueueMonitorActor.140095749474224[51443] INFO: got response with 2 items in 0.0693888664246 seconds, next poll at 2016-04-01 20:52:15
2016-04-01_20:52:05.19665 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D3_v2: wishlist 2, up 10 (booting 0, idle 1, busy 9), shutting down 0
2016-04-01_20:52:05.19694 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.19718 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.19741 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.20614 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D3_v2: wishlist 2, up 10 (booting 0, idle 1, busy 9), shutting down 0
2016-04-01_20:52:05.21357 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Want 1 more Standard_D3_v2 nodes.  Booting a node.
2016-04-01_20:52:05.70621 2016-04-01 20:52:05 ArvadosNodeListMonitorActor.140095751503264[51443] INFO: got response with 262 items in 0.576020002365 seconds, next poll at 2016-04-01 20:52:15
2016-04-01_20:52:05.76385 2016-04-01 20:52:05 pykka[51443] DEBUG: Registered ComputeNodeSetupActor (urn:uuid:d8b01ed6-a92c-4f92-ba80-ecdaee8b5717)
2016-04-01_20:52:05.76485 2016-04-01 20:52:05 pykka[51443] DEBUG: Starting ComputeNodeSetupActor (urn:uuid:d8b01ed6-a92c-4f92-ba80-ecdaee8b5717)
2016-04-01_20:52:05.76747 2016-04-01 20:52:05 pykka[51443] ERROR: Unhandled exception in NodeManagerDaemonActor (urn:uuid:902adcfd-aefa-43f1-82f2-df480ec63ec2):
2016-04-01_20:52:05.76754 Traceback (most recent call last):
2016-04-01_20:52:05.76759   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop
2016-04-01_20:52:05.76763     response = self._handle_receive(message)
2016-04-01_20:52:05.76777   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive
2016-04-01_20:52:05.76784     return callee(*message['args'], **message['kwargs'])
2016-04-01_20:52:05.76789   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 346, in wrapper
2016-04-01_20:52:05.76797     return orig_func(self, *args, **kwargs)
2016-04-01_20:52:05.76808   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 364, in start_node
2016-04-01_20:52:05.76813     cloud_size=cloud_size).proxy()
2016-04-01_20:52:05.76816   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 99, in start
2016-04-01_20:52:05.76820     obj._start_actor_loop()
2016-04-01_20:52:05.76823   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 367, in _start_actor_loop
2016-04-01_20:52:05.76826     thread.start()
2016-04-01_20:52:05.76835   File "/usr/lib/python2.7/threading.py", line 745, in start
2016-04-01_20:52:05.76839     _start_new_thread(self.__bootstrap, ())
2016-04-01_20:52:05.76842 error: can't start new thread
2016-04-01_20:52:05.76864 2016-04-01 20:52:05 pykka[51443] DEBUG: Unregistered NodeManagerDaemonActor (urn:uuid:902adcfd-aefa-43f1-82f2-df480ec63ec2)
2016-04-01_20:52:05.76881 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] CRITICAL: Unhandled exception is a fatal error, killing Node Manager
tail: ‘arvados-node-manager/log/main/current’ has become inaccessible: No such file or directory
tail: ‘arvados-node-manager/log/main/current’ has appeared;  following end of new file
2016-04-01_20:52:06.99188 Starting arvados-node-manager from /etc/sv/arvados-node-manager
2016-04-01_20:52:08.44991 2016-04-01 20:52:08 pykka[52648] DEBUG: Registered TimedCallBackActor (urn:uuid:efe8550c-dcc1-4bda-bc3f-ebf2bdfdef17)
2016-04-01_20:52:08.45017 2016-04-01 20:52:08 pykka[52648] DEBUG: Starting TimedCallBackActor (urn:uuid:efe8550c-dcc1-4bda-bc3f-ebf2bdfdef17)
2016-04-01_20:52:09.18461 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered CloudNodeListMonitorActor (urn:uuid:289e7dfd-7149-4fe4-ba2d-b0c1079652cd)
2016-04-01_20:52:09.18487 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting CloudNodeListMonitorActor (urn:uuid:289e7dfd-7149-4fe4-ba2d-b0c1079652cd)
2016-04-01_20:52:09.20825 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered ArvadosNodeListMonitorActor (urn:uuid:da3a72bd-6dad-43d9-a133-ec5a8612393a)
2016-04-01_20:52:09.20845 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting ArvadosNodeListMonitorActor (urn:uuid:da3a72bd-6dad-43d9-a133-ec5a8612393a)
2016-04-01_20:52:09.22401 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered JobQueueMonitorActor (urn:uuid:11ec074d-1e15-4435-bcb5-48761a6b886d)
2016-04-01_20:52:09.22421 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting JobQueueMonitorActor (urn:uuid:11ec074d-1e15-4435-bcb5-48761a6b886d)
2016-04-01_20:52:09.22504 2016-04-01 20:52:09 root[52648] ERROR: Uncaught exception during setup
2016-04-01_20:52:09.22510 Traceback (most recent call last):
2016-04-01_20:52:09.22514   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 112, in main
2016-04-01_20:52:09.22517     launch_pollers(config, server_calculator)
2016-04-01_20:52:09.22521   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 79, in launch_pollers
2016-04-01_20:52:09.22524     poll_time, max_poll_time).tell_proxy()
2016-04-01_20:52:09.22527   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 99, in start
2016-04-01_20:52:09.22534     obj._start_actor_loop()
2016-04-01_20:52:09.22538   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 367, in _start_actor_loop
2016-04-01_20:52:09.22541     thread.start()
2016-04-01_20:52:09.22545   File "/usr/lib/python2.7/threading.py", line 745, in start
2016-04-01_20:52:09.22548     _start_new_thread(self.__bootstrap, ())
2016-04-01_20:52:09.22551 error: can't start new thread

And at that point, node manager just hangs.


Related issues

Related to Arvados - Bug #8686: [Node Manager] qr1hi nodemanager can't start if ulimit is in place New03/14/2016

History

#1 Updated by Ward Vandewege over 4 years ago

  • Description updated (diff)

#2 Updated by Brett Smith over 4 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Brett Smith over 4 years ago

  • Subject changed from [Nodemanager] doesn't always kill itself when it can't create a thread to [Node Manager] Doesn't kill itself when unhandled exceptions are raised during actor setup

We implemented the death logic as an actor method that gets called when an unhandled exception is raised during message handling. This is happening before that: the actor can't even be created, because we can't spawn the thread to run it.

For the actors that get started at boot, those exceptions can be caught and stop the process inside the launcher function. Dealing with it for actors that get spawned under the Daemon Actor is a little trickier and needs more discussion.

Also available in: Atom PDF