Actions
Bug #8872
closed[Node Manager] On node creation, node search fails, and unhandled exceptions cascade up to the Daemon Actor
Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
-
Updated by Ward Vandewege over 8 years ago
Node manager seems to tie itself in a knot predictably on su92l at the moment, like this:
2016-04-03_19:13:13.09988 2016-04-03 19:13:13 ArvadosNodeListMonitorActor.140211098648048[969] INFO: got response with 357 items in 0.942911863327 seconds, next poll at 2016-04-03 19:13:22 2016-04-03_19:13:17.21261 2016-04-03 19:13:17 ComputeNodeSetupActor.3aa00c9d7757[969] WARNING: Re-raising unknown error (no retry): list_nodes returned 0 results for 'compute-i2fjib4fio73bcb-su92l' 2016-04-03_19:13:17.21275 Traceback (most recent call last): 2016-04-03_19:13:17.21280 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 74, in retry_wrapper 2016-04-03_19:13:17.21285 ret = orig_func(self, *args, **kwargs) 2016-04-03_19:13:17.21288 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 122, in create_cloud_node 2016-04-03_19:13:17.21292 self.arvados_node) 2016-04-03_19:13:17.21295 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 153, in create_node 2016-04-03_19:13:17.21300 node = self.search_for(kwargs['name'], 'list_nodes', self._name_key) 2016-04-03_19:13:17.21304 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 103, in search_for 2016-04-03_19:13:17.21309 list_method, count, term)) 2016-04-03_19:13:17.21314 ValueError: list_nodes returned 0 results for 'compute-i2fjib4fio73bcb-su92l' 2016-04-03_19:13:17.22511 2016-04-03 19:13:17 ComputeNodeSetupActor.3aa00c9d7757[969] ERROR: Actor error list_nodes returned 0 results for 'compute-i2fjib4fio73bcb-su92l' 2016-04-03_19:13:17.22520 2016-04-03 19:13:17 ComputeNodeSetupActor.3aa00c9d7757[969] INFO: finished 2016-04-03_19:13:19.72572 2016-04-03 19:13:19 ComputeNodeSetupActor.fd6af6cd3607[969] WARNING: Re-raising unknown error (no retry): list_nodes returned 0 results for 'compute-gio6sri3qlzh56s-su92l' 2016-04-03_19:13:19.72585 Traceback (most recent call last): 2016-04-03_19:13:19.72589 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 74, in retry_wrapper 2016-04-03_19:13:19.72593 ret = orig_func(self, *args, **kwargs) 2016-04-03_19:13:19.72597 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 122, in create_cloud_node 2016-04-03_19:13:19.72601 self.arvados_node) 2016-04-03_19:13:19.72604 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 153, in create_node 2016-04-03_19:13:19.72608 node = self.search_for(kwargs['name'], 'list_nodes', self._name_key) 2016-04-03_19:13:19.72611 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 103, in search_for 2016-04-03_19:13:19.72615 list_method, count, term)) 2016-04-03_19:13:19.72618 ValueError: list_nodes returned 0 results for 'compute-gio6sri3qlzh56s-su92l' 2016-04-03_19:13:19.72636 2016-04-03 19:13:19 ComputeNodeSetupActor.fd6af6cd3607[969] ERROR: Actor error list_nodes returned 0 results for 'compute-gio6sri3qlzh56s-su92l' 2016-04-03_19:13:19.72694 2016-04-03 19:13:19 ComputeNodeSetupActor.fd6af6cd3607[969] INFO: finished 2016-04-03_19:13:19.75696 2016-04-03 19:13:19 pykka[969] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:48a64720-1122-442c-a0e4-fd6af6cd3607) 2016-04-03_19:13:19.75717 2016-04-03 19:13:19 pykka[969] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:48a64720-1122-442c-a0e4-fd6af6cd3607) 2016-04-03_19:13:20.04752 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 21, up 32 (booting 20, idle 0, busy 12), shutting down 0 2016-04-03_19:13:20.04799 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.04841 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.04878 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.04910 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.04993 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0 2016-04-03_19:13:20.05028 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.05069 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.05104 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.54790 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 21, up 32 (booting 20, idle 0, busy 12), shutting down 0 2016-04-03_19:13:20.54832 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.54863 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.54898 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.54928 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.55010 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0 2016-04-03_19:13:20.55040 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.55070 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.55099 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.98113 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 21, up 32 (booting 20, idle 0, busy 12), shutting down 0 2016-04-03_19:13:20.98176 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.98210 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.98793 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.98825 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.99258 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0 2016-04-03_19:13:20.99292 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.99324 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:20.99353 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.31842 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Cloud node compute-la5o2ppkzjbgjqo-su92l is now paired with Arvados node su92l-7ekkf-la5o2ppkzjbgjqo 2016-04-03_19:13:21.31881 2016-04-03 19:13:21 ArvadosNodeListMonitorActor.140211098648048[969] DEBUG: urn:uuid:87c5fa66-6fa9-4f0c-b05e-e1af316c153d subscribed to events for 'su92l-7ekkf-la5o2ppkzjbgjqo' 2016-04-03_19:13:21.42428 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Cloud node compute-k1ut7sdnzdgzcpw-su92l is now paired with Arvados node su92l-7ekkf-k1ut7sdnzdgzcpw 2016-04-03_19:13:21.42461 2016-04-03 19:13:21 ArvadosNodeListMonitorActor.140211098648048[969] DEBUG: urn:uuid:b1b7a77e-fa38-4f71-84e4-c1342309b33a subscribed to events for 'su92l-7ekkf-k1ut7sdnzdgzcpw' 2016-04-03_19:13:21.44957 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 19, up 32 (booting 18, idle 2, busy 12), shutting down 0 2016-04-03_19:13:21.45117 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.45149 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.45185 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.45214 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.45614 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0 2016-04-03_19:13:21.45648 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.45720 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.45751 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.46746 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 19, up 32 (booting 18, idle 2, busy 12), shutting down 0 2016-04-03_19:13:21.46784 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.46815 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.46846 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.46878 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.46960 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0 2016-04-03_19:13:21.46991 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.47019 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.47048 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-04-03_19:13:21.47503 2016-04-03 19:13:21 pykka[969] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:ce8f8179-edc8-4576-ae14-3aa00c9d7757) 2016-04-03_19:13:21.47510 2016-04-03 19:13:21 pykka[969] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:ce8f8179-edc8-4576-ae14-3aa00c9d7757) 2016-04-03_19:13:21.48711 2016-04-03 19:13:21 pykka[969] ERROR: Unhandled exception in NodeManagerDaemonActor (urn:uuid:16c7553a-0c6e-4bda-b8d3-99e153f809d7): 2016-04-03_19:13:21.48718 Traceback (most recent call last): 2016-04-03_19:13:21.48723 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop 2016-04-03_19:13:21.48727 response = self._handle_receive(message) 2016-04-03_19:13:21.48731 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive 2016-04-03_19:13:21.48734 return callee(*message['args'], **message['kwargs']) 2016-04-03_19:13:21.48738 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 379, in node_up 2016-04-03_19:13:21.48741 cloud_node = setup_proxy.cloud_node.get() 2016-04-03_19:13:21.48745 File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 299, in get 2016-04-03_19:13:21.48748 exec('raise exc_info[0], exc_info[1], exc_info[2]') 2016-04-03_19:13:21.48751 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 470, in ask 2016-04-03_19:13:21.48755 self.tell(message) 2016-04-03_19:13:21.48758 File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 437, in tell 2016-04-03_19:13:21.48763 raise _ActorDeadError('%s not found' % self) 2016-04-03_19:13:21.48773 ActorDeadError: ComputeNodeSetupActor (urn:uuid:48a64720-1122-442c-a0e4-fd6af6cd3607) not found 2016-04-03_19:13:21.48799 2016-04-03 19:13:21 pykka[969] DEBUG: Unregistered NodeManagerDaemonActor (urn:uuid:16c7553a-0c6e-4bda-b8d3-99e153f809d7) 2016-04-03_19:13:22.17183 2016-04-03 19:13:22 JobQueueMonitorActor.140211088250880[969] DEBUG: sending request 2016-04-03_19:13:22.17197 2016-04-03 19:13:22 ArvadosNodeListMonitorActor.140211098648048[969] DEBUG: sending request 2016-04-03_19:13:22.18447 2016-04-03 19:13:22 ComputeNodeSetupActor.4b3b36d9472c[969] WARNING: Re-raising unknown error (no retry): list_nodes returned 0 results for 'compute-jvh1ctbr59fbt2x-su92l' 2016-04-03_19:13:22.18455 Traceback (most recent call last): 2016-04-03_19:13:22.18460 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 74, in retry_wrapper 2016-04-03_19:13:22.18464 ret = orig_func(self, *args, **kwargs) 2016-04-03_19:13:22.18468 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 122, in create_cloud_node 2016-04-03_19:13:22.18471 self.arvados_node) 2016-04-03_19:13:22.18475 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 153, in create_node 2016-04-03_19:13:22.18478 node = self.search_for(kwargs['name'], 'list_nodes', self._name_key) 2016-04-03_19:13:22.18481 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 103, in search_for 2016-04-03_19:13:22.18485 list_method, count, term)) 2016-04-03_19:13:22.18488 ValueError: list_nodes returned 0 results for 'compute-jvh1ctbr59fbt2x-su92l' 2016-04-03_19:13:22.18505 2016-04-03 19:13:22 ComputeNodeSetupActor.4b3b36d9472c[969] ERROR: Actor error list_nodes returned 0 results for 'compute-jvh1ctbr59fbt2x-su92l' 2016-04-03_19:13:22.18560 2016-04-03 19:13:22 ComputeNodeSetupActor.4b3b36d9472c[969] INFO: finished
Once the DeamonActor dies, nothing much else works anymore.
Updated by Brett Smith over 8 years ago
- Target version set to Arvados Future Sprints
Updated by Brett Smith over 8 years ago
- Subject changed from [Nodemanager] list_nodes traceback, stops booting nodes to [Node Manager] On node creation, node search fails, and unhandled exceptions cascade up to the Daemon Actor
- Category set to Node Manager
Updated by Brett Smith over 8 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version changed from Arvados Future Sprints to 2016-04-13 sprint
Updated by Peter Amstutz over 8 years ago
Thanks. When wrote this I didn't look hard enough at what search_for()
actually did.
Nits (optional):
The docstring on search_for
doesn't document its caching behavior.
def cloud_name_from_arv_node(self, arv_node):
Instead of adding this to the test classes, couldn't this be DRYd up by adding them to the actual ComputeNodeDriver
classes so that both arvados_create_kwargs()
and the tests can use same the code?
The rest looks good to me (please go ahead and merge).
Updated by Brett Smith over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:68d551b65a837664be72bf08aad55ab76d778d07.
Actions