Bug #8871

Updated by Ward Vandewege over 4 years ago

I've seen this a few times now on su92l:

<pre>
2016-04-01_20:52:05.13582 2016-04-01 20:52:05 ArvadosNodeListMonitorActor.140095751503264[51443] DEBUG: sending request
2016-04-01_20:52:05.18295 2016-04-01 20:52:05 JobQueueMonitorActor.140095749474224[51443] DEBUG: Calculated wishlist: Standard_D3_v2, Standard_D3_v2
2016-04-01_20:52:05.18486 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D14_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.18493 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.18498 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.18536 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.18543 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.19228 2016-04-01 20:52:05 JobQueueMonitorActor.140095749474224[51443] INFO: got response with 2 items in 0.0693888664246 seconds, next poll at 2016-04-01 20:52:15
2016-04-01_20:52:05.19665 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D3_v2: wishlist 2, up 10 (booting 0, idle 1, busy 9), shutting down 0
2016-04-01_20:52:05.19694 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.19718 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.19741 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-01_20:52:05.20614 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D3_v2: wishlist 2, up 10 (booting 0, idle 1, busy 9), shutting down 0
2016-04-01_20:52:05.21357 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Want 1 more Standard_D3_v2 nodes. Booting a node.
2016-04-01_20:52:05.70621 2016-04-01 20:52:05 ArvadosNodeListMonitorActor.140095751503264[51443] INFO: got response with 262 items in 0.576020002365 seconds, next poll at 2016-04-01 20:52:15
2016-04-01_20:52:05.76385 2016-04-01 20:52:05 pykka[51443] DEBUG: Registered ComputeNodeSetupActor (urn:uuid:d8b01ed6-a92c-4f92-ba80-ecdaee8b5717)
2016-04-01_20:52:05.76485 2016-04-01 20:52:05 pykka[51443] DEBUG: Starting ComputeNodeSetupActor (urn:uuid:d8b01ed6-a92c-4f92-ba80-ecdaee8b5717)
2016-04-01_20:52:05.76747 2016-04-01 20:52:05 pykka[51443] ERROR: Unhandled exception in NodeManagerDaemonActor (urn:uuid:902adcfd-aefa-43f1-82f2-df480ec63ec2):
2016-04-01_20:52:05.76754 Traceback (most recent call last):
2016-04-01_20:52:05.76759 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop
2016-04-01_20:52:05.76763 response = self._handle_receive(message)
2016-04-01_20:52:05.76777 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive
2016-04-01_20:52:05.76784 return callee(*message['args'], **message['kwargs'])
2016-04-01_20:52:05.76789 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 346, in wrapper
2016-04-01_20:52:05.76797 return orig_func(self, *args, **kwargs)
2016-04-01_20:52:05.76808 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 364, in start_node
2016-04-01_20:52:05.76813 cloud_size=cloud_size).proxy()
2016-04-01_20:52:05.76816 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 99, in start
2016-04-01_20:52:05.76820 obj._start_actor_loop()
2016-04-01_20:52:05.76823 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 367, in _start_actor_loop
2016-04-01_20:52:05.76826 thread.start()
2016-04-01_20:52:05.76835 File "/usr/lib/python2.7/threading.py", line 745, in start
2016-04-01_20:52:05.76839 _start_new_thread(self.__bootstrap, ())
2016-04-01_20:52:05.76842 error: can't start new thread
2016-04-01_20:52:05.76864 2016-04-01 20:52:05 pykka[51443] DEBUG: Unregistered NodeManagerDaemonActor (urn:uuid:902adcfd-aefa-43f1-82f2-df480ec63ec2)
2016-04-01_20:52:05.76881 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] CRITICAL: Unhandled exception is a fatal error, killing Node Manager
tail: ‘arvados-node-manager/log/main/current’ has become inaccessible: No such file or directory
tail: ‘arvados-node-manager/log/main/current’ has appeared; following end of new file
2016-04-01_20:52:06.99188 Starting arvados-node-manager from /etc/sv/arvados-node-manager
2016-04-01_20:52:08.44991 2016-04-01 20:52:08 pykka[52648] DEBUG: Registered TimedCallBackActor (urn:uuid:efe8550c-dcc1-4bda-bc3f-ebf2bdfdef17)
2016-04-01_20:52:08.45017 2016-04-01 20:52:08 pykka[52648] DEBUG: Starting TimedCallBackActor (urn:uuid:efe8550c-dcc1-4bda-bc3f-ebf2bdfdef17)
2016-04-01_20:52:09.18461 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered CloudNodeListMonitorActor (urn:uuid:289e7dfd-7149-4fe4-ba2d-b0c1079652cd)
2016-04-01_20:52:09.18487 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting CloudNodeListMonitorActor (urn:uuid:289e7dfd-7149-4fe4-ba2d-b0c1079652cd)
2016-04-01_20:52:09.20825 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered ArvadosNodeListMonitorActor (urn:uuid:da3a72bd-6dad-43d9-a133-ec5a8612393a)
2016-04-01_20:52:09.20845 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting ArvadosNodeListMonitorActor (urn:uuid:da3a72bd-6dad-43d9-a133-ec5a8612393a)
2016-04-01_20:52:09.22401 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered JobQueueMonitorActor (urn:uuid:11ec074d-1e15-4435-bcb5-48761a6b886d)
2016-04-01_20:52:09.22421 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting JobQueueMonitorActor (urn:uuid:11ec074d-1e15-4435-bcb5-48761a6b886d)
2016-04-01_20:52:09.22504 2016-04-01 20:52:09 root[52648] ERROR: Uncaught exception during setup
2016-04-01_20:52:09.22510 Traceback (most recent call last):
2016-04-01_20:52:09.22514 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 112, in main
2016-04-01_20:52:09.22517 launch_pollers(config, server_calculator)
2016-04-01_20:52:09.22521 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 79, in launch_pollers
2016-04-01_20:52:09.22524 poll_time, max_poll_time).tell_proxy()
2016-04-01_20:52:09.22527 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 99, in start
2016-04-01_20:52:09.22534 obj._start_actor_loop()
2016-04-01_20:52:09.22538 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 367, in _start_actor_loop
2016-04-01_20:52:09.22541 thread.start()
2016-04-01_20:52:09.22545 File "/usr/lib/python2.7/threading.py", line 745, in start
2016-04-01_20:52:09.22548 _start_new_thread(self.__bootstrap, ())
2016-04-01_20:52:09.22551 error: can't start new thread
</pre>

And at that point, node manager just hangs.

Back