Project

General

Profile

Bug #8871

Updated by Ward Vandewege over 6 years ago

I've seen this a few times now on su92l: 

 <pre> 
 2016-04-01_20:52:05.13582 2016-04-01 20:52:05 ArvadosNodeListMonitorActor.140095751503264[51443] DEBUG: sending request 
 2016-04-01_20:52:05.18295 2016-04-01 20:52:05 JobQueueMonitorActor.140095749474224[51443] DEBUG: Calculated wishlist: Standard_D3_v2, Standard_D3_v2 
 2016-04-01_20:52:05.18486 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D14_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 
 2016-04-01_20:52:05.18493 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 
 2016-04-01_20:52:05.18498 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 
 2016-04-01_20:52:05.18536 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 
 2016-04-01_20:52:05.18543 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 
 2016-04-01_20:52:05.19228 2016-04-01 20:52:05 JobQueueMonitorActor.140095749474224[51443] INFO: got response with 2 items in 0.0693888664246 seconds, next poll at 2016-04-01 20:52:15 
 2016-04-01_20:52:05.19665 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D3_v2: wishlist 2, up 10 (booting 0, idle 1, busy 9), shutting down 0 
 2016-04-01_20:52:05.19694 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 
 2016-04-01_20:52:05.19718 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 
 2016-04-01_20:52:05.19741 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 
 2016-04-01_20:52:05.20614 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Standard_D3_v2: wishlist 2, up 10 (booting 0, idle 1, busy 9), shutting down 0 
 2016-04-01_20:52:05.21357 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] INFO: Want 1 more Standard_D3_v2 nodes.    Booting a node. 
 2016-04-01_20:52:05.70621 2016-04-01 20:52:05 ArvadosNodeListMonitorActor.140095751503264[51443] INFO: got response with 262 items in 0.576020002365 seconds, next poll at 2016-04-01 20:52:15 
 2016-04-01_20:52:05.76385 2016-04-01 20:52:05 pykka[51443] DEBUG: Registered ComputeNodeSetupActor (urn:uuid:d8b01ed6-a92c-4f92-ba80-ecdaee8b5717) 
 2016-04-01_20:52:05.76485 2016-04-01 20:52:05 pykka[51443] DEBUG: Starting ComputeNodeSetupActor (urn:uuid:d8b01ed6-a92c-4f92-ba80-ecdaee8b5717) 
 2016-04-01_20:52:05.76747 2016-04-01 20:52:05 pykka[51443] ERROR: Unhandled exception in NodeManagerDaemonActor (urn:uuid:902adcfd-aefa-43f1-82f2-df480ec63ec2): 
 2016-04-01_20:52:05.76754 Traceback (most recent call last): 
 2016-04-01_20:52:05.76759     File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop 
 2016-04-01_20:52:05.76763       response = self._handle_receive(message) 
 2016-04-01_20:52:05.76777     File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive 
 2016-04-01_20:52:05.76784       return callee(*message['args'], **message['kwargs']) 
 2016-04-01_20:52:05.76789     File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 346, in wrapper 
 2016-04-01_20:52:05.76797       return orig_func(self, *args, **kwargs) 
 2016-04-01_20:52:05.76808     File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 364, in start_node 
 2016-04-01_20:52:05.76813       cloud_size=cloud_size).proxy() 
 2016-04-01_20:52:05.76816     File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 99, in start 
 2016-04-01_20:52:05.76820       obj._start_actor_loop() 
 2016-04-01_20:52:05.76823     File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 367, in _start_actor_loop 
 2016-04-01_20:52:05.76826       thread.start() 
 2016-04-01_20:52:05.76835     File "/usr/lib/python2.7/threading.py", line 745, in start 
 2016-04-01_20:52:05.76839       _start_new_thread(self.__bootstrap, ()) 
 2016-04-01_20:52:05.76842 error: can't start new thread 
 2016-04-01_20:52:05.76864 2016-04-01 20:52:05 pykka[51443] DEBUG: Unregistered NodeManagerDaemonActor (urn:uuid:902adcfd-aefa-43f1-82f2-df480ec63ec2) 
 2016-04-01_20:52:05.76881 2016-04-01 20:52:05 NodeManagerDaemonActor.df480ec63ec2[51443] CRITICAL: Unhandled exception is a fatal error, killing Node Manager 
 tail: ‘arvados-node-manager/log/main/current’ has become inaccessible: No such file or directory 
 tail: ‘arvados-node-manager/log/main/current’ has appeared;    following end of new file 
 2016-04-01_20:52:06.99188 Starting arvados-node-manager from /etc/sv/arvados-node-manager 
 2016-04-01_20:52:08.44991 2016-04-01 20:52:08 pykka[52648] DEBUG: Registered TimedCallBackActor (urn:uuid:efe8550c-dcc1-4bda-bc3f-ebf2bdfdef17) 
 2016-04-01_20:52:08.45017 2016-04-01 20:52:08 pykka[52648] DEBUG: Starting TimedCallBackActor (urn:uuid:efe8550c-dcc1-4bda-bc3f-ebf2bdfdef17) 
 2016-04-01_20:52:09.18461 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered CloudNodeListMonitorActor (urn:uuid:289e7dfd-7149-4fe4-ba2d-b0c1079652cd) 
 2016-04-01_20:52:09.18487 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting CloudNodeListMonitorActor (urn:uuid:289e7dfd-7149-4fe4-ba2d-b0c1079652cd) 
 2016-04-01_20:52:09.20825 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered ArvadosNodeListMonitorActor (urn:uuid:da3a72bd-6dad-43d9-a133-ec5a8612393a) 
 2016-04-01_20:52:09.20845 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting ArvadosNodeListMonitorActor (urn:uuid:da3a72bd-6dad-43d9-a133-ec5a8612393a) 
 2016-04-01_20:52:09.22401 2016-04-01 20:52:09 pykka[52648] DEBUG: Registered JobQueueMonitorActor (urn:uuid:11ec074d-1e15-4435-bcb5-48761a6b886d) 
 2016-04-01_20:52:09.22421 2016-04-01 20:52:09 pykka[52648] DEBUG: Starting JobQueueMonitorActor (urn:uuid:11ec074d-1e15-4435-bcb5-48761a6b886d) 
 2016-04-01_20:52:09.22504 2016-04-01 20:52:09 root[52648] ERROR: Uncaught exception during setup 
 2016-04-01_20:52:09.22510 Traceback (most recent call last): 
 2016-04-01_20:52:09.22514     File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 112, in main 
 2016-04-01_20:52:09.22517       launch_pollers(config, server_calculator) 
 2016-04-01_20:52:09.22521     File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 79, in launch_pollers 
 2016-04-01_20:52:09.22524       poll_time, max_poll_time).tell_proxy() 
 2016-04-01_20:52:09.22527     File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 99, in start 
 2016-04-01_20:52:09.22534       obj._start_actor_loop() 
 2016-04-01_20:52:09.22538     File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 367, in _start_actor_loop 
 2016-04-01_20:52:09.22541       thread.start() 
 2016-04-01_20:52:09.22545     File "/usr/lib/python2.7/threading.py", line 745, in start 
 2016-04-01_20:52:09.22548       _start_new_thread(self.__bootstrap, ()) 
 2016-04-01_20:52:09.22551 error: can't start new thread 
 </pre> 

 And at that point, node manager just hangs.

Back