Bug #5714
closed[Node Manager] Diagnose and fix bugs on GCE
Description
Right now Node Manager keeps getting into a state where a ComputeNodeSetupActor dies, and then things go south from there. Diagnose and fix the issue so Node Manager is more reliable on GCE.
Updated by Brett Smith over 9 years ago
- Project changed from 40 to Arvados
- Category set to Node Manager
Updated by Brett Smith over 9 years ago
5714-gce-setup-bugfixes-wip is up for review. It fixes a clear error retry problem, and avoids a potential race condition around shutting down ComputeNodeSetupActors (which might be where the ActorDeadErrors we're seeing in the logs are coming from). See commit messages for more detailed rationale.
Updated by Brett Smith over 9 years ago
It seems very possible that the race condition is the real source of our trouble. I just caught this happening again on su92l. Here was the series of events:
- Jobs came in that caused Node Manager to request many nodes.
- Eventually we hit our CPUs quota on GCP. This caused ComputeNodeSetupActors to start taking a longer time than usual to get their work done.
- But with nodes that were created successfully, and jobs finishing, we eventually got to a point where no more nodes were necessary. This got Node Manager to abort setup processes:
2015-04-13_17:10:34.70669 2015-04-13 17:10:34 arvnodeman.jobqueue[594] DEBUG: JobQueueMonitorActor (at 50569808) sending poll 2015-04-13_17:10:34.70914 2015-04-13 17:10:34 arvnodeman.arvados_nodes[594] DEBUG: ArvadosNodeListMonitorActor (at 44869072) sending poll 2015-04-13_17:10:34.70960 2015-04-13 17:10:34 arvnodeman.cloud_nodes[594] DEBUG: CloudNodeListMonitorActor (at 43054288) sending poll 2015-04-13_17:10:34.76389 2015-04-13 17:10:34 arvnodeman.jobqueue[594] DEBUG: Sending server wishlist: (empty) 2015-04-13_17:10:34.76400 2015-04-13 17:10:34 arvnodeman.jobqueue[594] DEBUG: JobQueueMonitorActor (at 50569808) got response with 0 items 2015-04-13_17:10:34.78720 2015-04-13 17:10:34 pykka[594] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:4ee1d47a-923c-462c-8b16-ade4b24d 95d4) 2015-04-13_17:10:34.78722 2015-04-13 17:10:34 pykka[594] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:4ee1d47a-923c-462c-8b16-ade4b24d95d4) 2015-04-13_17:10:34.79757 2015-04-13 17:10:34 pykka[594] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:4d7a01a2-3f7d-44b9-83ea-ee3ea50e36b1) 2015-04-13_17:10:34.79804 2015-04-13 17:10:34 pykka[594] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:239a4d57-d0ef-4ff3-a803-27afc0696388) 2015-04-13_17:10:34.79831 2015-04-13 17:10:34 pykka[594] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:4d7a01a2-3f7d-44b9-83ea-ee3ea50e36b1) 2015-04-13_17:10:34.79838 2015-04-13 17:10:34 pykka[594] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:239a4d57-d0ef-4ff3-a803-27afc0696388)
- But Node Manager kept trying to abort one of those setup actors:
2015-04-13_17:13:34.75341 2015-04-13 17:13:34 pykka[594] DEBUG: Exception returned from NodeManagerDaemonActor (urn:uuid:3fc1a51e-f0eb-45ab-b ac1-91a894ce484e) to caller: 2015-04-13_17:13:34.75342 Traceback (most recent call last): 2015-04-13_17:13:34.75343 File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop 2015-04-13_17:13:34.75343 response = self._handle_receive(message) 2015-04-13_17:13:34.75344 File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive 2015-04-13_17:13:34.75344 return callee(*message['args'], **message['kwargs']) 2015-04-13_17:13:34.75345 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 251, in wrapper 2015-04-13_17:13:34.75345 return orig_func(self, *args, **kwargs) 2015-04-13_17:13:34.75345 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 302, in stop_booting_node 2015-04-13_17:13:34.75345 node.stop_if_no_cloud_node().get() 2015-04-13_17:13:34.75346 File "/usr/local/lib/python2.7/dist-packages/pykka/future.py", line 299, in get 2015-04-13_17:13:34.75346 exec('raise exc_info[0], exc_info[1], exc_info[2]') 2015-04-13_17:13:34.75346 File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 470, in ask 2015-04-13_17:13:34.75346 self.tell(message) 2015-04-13_17:13:34.75347 File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 437, in tell 2015-04-13_17:13:34.75347 raise _ActorDeadError('%s not found' % self) 2015-04-13_17:13:34.75348 ActorDeadError: ComputeNodeSetupActor (urn:uuid:4d7a01a2-3f7d-44b9-83ea-ee3ea50e36b1) not found
There are only two places where the daemon stops a setup actor: node_up
, and stop_booting_node
. node_up
unconditionally stops the actor and removes it from self.booting
. stop_booting_node
previously sent a message and then did a check before removing it from self.booting
. The patch in the branch closes the gap, and so I'm more confident that it fixes the real bug we're seeing now. Especially since Node Manager stopped two setup actors, but only kept sending messages to one of them—that suggests that there's a race condition at fault, rather than an unconditional code bug.
Updated by Brett Smith over 9 years ago
- Status changed from New to Resolved
Applied in changeset arvados|commit:0d8d66df56992a39cc032ba482e1ff88de7f22ab.