Bug #8872

[Node Manager] On node creation, node search fails, and unhandled exceptions cascade up to the Daemon Actor

Added by Ward Vandewege about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Node Manager
Target version:
Start date:
04/06/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Subtasks

Task #8900: Review 8872-node-manager-create-search-handling-wipResolvedBrett Smith

Associated revisions

Revision 68d551b6
Added by Brett Smith about 5 years ago

Merge branch '8872-node-manager-create-search-handling-wip'

Closes #8872, #8900.

History

#1 Updated by Ward Vandewege about 5 years ago

Node manager seems to tie itself in a knot predictably on su92l at the moment, like this:

2016-04-03_19:13:13.09988 2016-04-03 19:13:13 ArvadosNodeListMonitorActor.140211098648048[969] INFO: got response with 357 items in 0.942911863327 seconds, next poll at 2016-04-03 19:13:22
2016-04-03_19:13:17.21261 2016-04-03 19:13:17 ComputeNodeSetupActor.3aa00c9d7757[969] WARNING: Re-raising unknown error (no retry): list_nodes returned 0 results for 'compute-i2fjib4fio73bcb-su92l'
2016-04-03_19:13:17.21275 Traceback (most recent call last):
2016-04-03_19:13:17.21280   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 74, in retry_wrapper
2016-04-03_19:13:17.21285     ret = orig_func(self, *args, **kwargs)
2016-04-03_19:13:17.21288   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 122, in create_cloud_node
2016-04-03_19:13:17.21292     self.arvados_node)
2016-04-03_19:13:17.21295   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 153, in create_node
2016-04-03_19:13:17.21300     node = self.search_for(kwargs['name'], 'list_nodes', self._name_key)
2016-04-03_19:13:17.21304   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 103, in search_for
2016-04-03_19:13:17.21309     list_method, count, term))
2016-04-03_19:13:17.21314 ValueError: list_nodes returned 0 results for 'compute-i2fjib4fio73bcb-su92l'
2016-04-03_19:13:17.22511 2016-04-03 19:13:17 ComputeNodeSetupActor.3aa00c9d7757[969] ERROR: Actor error list_nodes returned 0 results for 'compute-i2fjib4fio73bcb-su92l'
2016-04-03_19:13:17.22520 2016-04-03 19:13:17 ComputeNodeSetupActor.3aa00c9d7757[969] INFO: finished
2016-04-03_19:13:19.72572 2016-04-03 19:13:19 ComputeNodeSetupActor.fd6af6cd3607[969] WARNING: Re-raising unknown error (no retry): list_nodes returned 0 results for 'compute-gio6sri3qlzh56s-su92l'
2016-04-03_19:13:19.72585 Traceback (most recent call last):
2016-04-03_19:13:19.72589   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 74, in retry_wrapper
2016-04-03_19:13:19.72593     ret = orig_func(self, *args, **kwargs)
2016-04-03_19:13:19.72597   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 122, in create_cloud_node
2016-04-03_19:13:19.72601     self.arvados_node)
2016-04-03_19:13:19.72604   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 153, in create_node
2016-04-03_19:13:19.72608     node = self.search_for(kwargs['name'], 'list_nodes', self._name_key)
2016-04-03_19:13:19.72611   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 103, in search_for
2016-04-03_19:13:19.72615     list_method, count, term))
2016-04-03_19:13:19.72618 ValueError: list_nodes returned 0 results for 'compute-gio6sri3qlzh56s-su92l'
2016-04-03_19:13:19.72636 2016-04-03 19:13:19 ComputeNodeSetupActor.fd6af6cd3607[969] ERROR: Actor error list_nodes returned 0 results for 'compute-gio6sri3qlzh56s-su92l'
2016-04-03_19:13:19.72694 2016-04-03 19:13:19 ComputeNodeSetupActor.fd6af6cd3607[969] INFO: finished
2016-04-03_19:13:19.75696 2016-04-03 19:13:19 pykka[969] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:48a64720-1122-442c-a0e4-fd6af6cd3607)
2016-04-03_19:13:19.75717 2016-04-03 19:13:19 pykka[969] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:48a64720-1122-442c-a0e4-fd6af6cd3607)
2016-04-03_19:13:20.04752 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 21, up 32 (booting 20, idle 0, busy 12), shutting down 0
2016-04-03_19:13:20.04799 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.04841 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.04878 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.04910 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.04993 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0
2016-04-03_19:13:20.05028 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.05069 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.05104 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.54790 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 21, up 32 (booting 20, idle 0, busy 12), shutting down 0
2016-04-03_19:13:20.54832 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.54863 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.54898 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.54928 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.55010 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0
2016-04-03_19:13:20.55040 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.55070 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.55099 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.98113 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 21, up 32 (booting 20, idle 0, busy 12), shutting down 0
2016-04-03_19:13:20.98176 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.98210 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.98793 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.98825 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.99258 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0
2016-04-03_19:13:20.99292 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.99324 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:20.99353 2016-04-03 19:13:20 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.31842 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Cloud node compute-la5o2ppkzjbgjqo-su92l is now paired with Arvados node su92l-7ekkf-la5o2ppkzjbgjqo
2016-04-03_19:13:21.31881 2016-04-03 19:13:21 ArvadosNodeListMonitorActor.140211098648048[969] DEBUG: urn:uuid:87c5fa66-6fa9-4f0c-b05e-e1af316c153d subscribed to events for 'su92l-7ekkf-la5o2ppkzjbgjqo'
2016-04-03_19:13:21.42428 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Cloud node compute-k1ut7sdnzdgzcpw-su92l is now paired with Arvados node su92l-7ekkf-k1ut7sdnzdgzcpw
2016-04-03_19:13:21.42461 2016-04-03 19:13:21 ArvadosNodeListMonitorActor.140211098648048[969] DEBUG: urn:uuid:b1b7a77e-fa38-4f71-84e4-c1342309b33a subscribed to events for 'su92l-7ekkf-k1ut7sdnzdgzcpw'
2016-04-03_19:13:21.44957 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 19, up 32 (booting 18, idle 2, busy 12), shutting down 0
2016-04-03_19:13:21.45117 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.45149 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.45185 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.45214 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.45614 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0
2016-04-03_19:13:21.45648 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.45720 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.45751 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.46746 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D14_v2: wishlist 19, up 32 (booting 18, idle 2, busy 12), shutting down 0
2016-04-03_19:13:21.46784 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D5_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.46815 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D13_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.46846 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D4_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.46878 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D12_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.46960 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D3_v2: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0
2016-04-03_19:13:21.46991 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D11_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.47019 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D2_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.47048 2016-04-03 19:13:21 NodeManagerDaemonActor.99e153f809d7[969] INFO: Standard_D1_v2: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-04-03_19:13:21.47503 2016-04-03 19:13:21 pykka[969] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:ce8f8179-edc8-4576-ae14-3aa00c9d7757)
2016-04-03_19:13:21.47510 2016-04-03 19:13:21 pykka[969] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:ce8f8179-edc8-4576-ae14-3aa00c9d7757)
2016-04-03_19:13:21.48711 2016-04-03 19:13:21 pykka[969] ERROR: Unhandled exception in NodeManagerDaemonActor (urn:uuid:16c7553a-0c6e-4bda-b8d3-99e153f809d7):
2016-04-03_19:13:21.48718 Traceback (most recent call last):
2016-04-03_19:13:21.48723   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop
2016-04-03_19:13:21.48727     response = self._handle_receive(message)
2016-04-03_19:13:21.48731   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive
2016-04-03_19:13:21.48734     return callee(*message['args'], **message['kwargs'])
2016-04-03_19:13:21.48738   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 379, in node_up
2016-04-03_19:13:21.48741     cloud_node = setup_proxy.cloud_node.get()
2016-04-03_19:13:21.48745   File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 299, in get
2016-04-03_19:13:21.48748     exec('raise exc_info[0], exc_info[1], exc_info[2]')
2016-04-03_19:13:21.48751   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 470, in ask
2016-04-03_19:13:21.48755     self.tell(message)
2016-04-03_19:13:21.48758   File "/usr/lib/python2.7/dist-packages/pykka/actor.py", line 437, in tell
2016-04-03_19:13:21.48763     raise _ActorDeadError('%s not found' % self)
2016-04-03_19:13:21.48773 ActorDeadError: ComputeNodeSetupActor (urn:uuid:48a64720-1122-442c-a0e4-fd6af6cd3607) not found
2016-04-03_19:13:21.48799 2016-04-03 19:13:21 pykka[969] DEBUG: Unregistered NodeManagerDaemonActor (urn:uuid:16c7553a-0c6e-4bda-b8d3-99e153f809d7)
2016-04-03_19:13:22.17183 2016-04-03 19:13:22 JobQueueMonitorActor.140211088250880[969] DEBUG: sending request
2016-04-03_19:13:22.17197 2016-04-03 19:13:22 ArvadosNodeListMonitorActor.140211098648048[969] DEBUG: sending request
2016-04-03_19:13:22.18447 2016-04-03 19:13:22 ComputeNodeSetupActor.4b3b36d9472c[969] WARNING: Re-raising unknown error (no retry): list_nodes returned 0 results for 'compute-jvh1ctbr59fbt2x-su92l'
2016-04-03_19:13:22.18455 Traceback (most recent call last):
2016-04-03_19:13:22.18460   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 74, in retry_wrapper
2016-04-03_19:13:22.18464     ret = orig_func(self, *args, **kwargs)
2016-04-03_19:13:22.18468   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 122, in create_cloud_node
2016-04-03_19:13:22.18471     self.arvados_node)
2016-04-03_19:13:22.18475   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 153, in create_node
2016-04-03_19:13:22.18478     node = self.search_for(kwargs['name'], 'list_nodes', self._name_key)
2016-04-03_19:13:22.18481   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 103, in search_for
2016-04-03_19:13:22.18485     list_method, count, term))
2016-04-03_19:13:22.18488 ValueError: list_nodes returned 0 results for 'compute-jvh1ctbr59fbt2x-su92l'
2016-04-03_19:13:22.18505 2016-04-03 19:13:22 ComputeNodeSetupActor.4b3b36d9472c[969] ERROR: Actor error list_nodes returned 0 results for 'compute-jvh1ctbr59fbt2x-su92l'
2016-04-03_19:13:22.18560 2016-04-03 19:13:22 ComputeNodeSetupActor.4b3b36d9472c[969] INFO: finished

Once the DeamonActor dies, nothing much else works anymore.

#2 Updated by Brett Smith about 5 years ago

  • Target version set to Arvados Future Sprints

#3 Updated by Brett Smith about 5 years ago

  • Subject changed from [Nodemanager] list_nodes traceback, stops booting nodes to [Node Manager] On node creation, node search fails, and unhandled exceptions cascade up to the Daemon Actor
  • Category set to Node Manager

#4 Updated by Brett Smith about 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version changed from Arvados Future Sprints to 2016-04-13 sprint

#5 Updated by Peter Amstutz about 5 years ago

Thanks. When wrote this I didn't look hard enough at what search_for() actually did.

Nits (optional):

The docstring on search_for doesn't document its caching behavior.

    def cloud_name_from_arv_node(self, arv_node):

Instead of adding this to the test classes, couldn't this be DRYd up by adding them to the actual ComputeNodeDriver classes so that both arvados_create_kwargs() and the tests can use same the code?

The rest looks good to me (please go ahead and merge).

#6 Updated by Brett Smith about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:68d551b65a837664be72bf08aad55ab76d778d07.

Also available in: Atom PDF