Project

General

Profile

Actions

Bug #5714

closed

[Node Manager] Diagnose and fix bugs on GCE

Added by Brett Smith about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
-

Description

Right now Node Manager keeps getting into a state where a ComputeNodeSetupActor dies, and then things go south from there. Diagnose and fix the issue so Node Manager is more reliable on GCE.


Subtasks 1 (0 open1 closed)

Task #5715: Review 5714-gce-setup-bugfixes-wipResolvedWard Vandewege04/13/2015Actions
Actions #1

Updated by Brett Smith about 9 years ago

  • Project changed from 40 to Arvados
  • Category set to Node Manager
Actions #2

Updated by Brett Smith about 9 years ago

5714-gce-setup-bugfixes-wip is up for review. It fixes a clear error retry problem, and avoids a potential race condition around shutting down ComputeNodeSetupActors (which might be where the ActorDeadErrors we're seeing in the logs are coming from). See commit messages for more detailed rationale.

Actions #3

Updated by Brett Smith about 9 years ago

It seems very possible that the race condition is the real source of our trouble. I just caught this happening again on su92l. Here was the series of events:

  • Jobs came in that caused Node Manager to request many nodes.
  • Eventually we hit our CPUs quota on GCP. This caused ComputeNodeSetupActors to start taking a longer time than usual to get their work done.
  • But with nodes that were created successfully, and jobs finishing, we eventually got to a point where no more nodes were necessary. This got Node Manager to abort setup processes:
    2015-04-13_17:10:34.70669 2015-04-13 17:10:34 arvnodeman.jobqueue[594] DEBUG: JobQueueMonitorActor (at 50569808) sending poll
    2015-04-13_17:10:34.70914 2015-04-13 17:10:34 arvnodeman.arvados_nodes[594] DEBUG: ArvadosNodeListMonitorActor (at 44869072) sending poll
    2015-04-13_17:10:34.70960 2015-04-13 17:10:34 arvnodeman.cloud_nodes[594] DEBUG: CloudNodeListMonitorActor (at 43054288) sending poll
    2015-04-13_17:10:34.76389 2015-04-13 17:10:34 arvnodeman.jobqueue[594] DEBUG: Sending server wishlist: (empty)
    2015-04-13_17:10:34.76400 2015-04-13 17:10:34 arvnodeman.jobqueue[594] DEBUG: JobQueueMonitorActor (at 50569808) got response with 0 items
    2015-04-13_17:10:34.78720 2015-04-13 17:10:34 pykka[594] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:4ee1d47a-923c-462c-8b16-ade4b24d
    95d4)
    2015-04-13_17:10:34.78722 2015-04-13 17:10:34 pykka[594] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:4ee1d47a-923c-462c-8b16-ade4b24d95d4)
    2015-04-13_17:10:34.79757 2015-04-13 17:10:34 pykka[594] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:4d7a01a2-3f7d-44b9-83ea-ee3ea50e36b1)
    2015-04-13_17:10:34.79804 2015-04-13 17:10:34 pykka[594] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:239a4d57-d0ef-4ff3-a803-27afc0696388)
    2015-04-13_17:10:34.79831 2015-04-13 17:10:34 pykka[594] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:4d7a01a2-3f7d-44b9-83ea-ee3ea50e36b1)
    2015-04-13_17:10:34.79838 2015-04-13 17:10:34 pykka[594] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:239a4d57-d0ef-4ff3-a803-27afc0696388)
    
  • But Node Manager kept trying to abort one of those setup actors:
    2015-04-13_17:13:34.75341 2015-04-13 17:13:34 pykka[594] DEBUG: Exception returned from NodeManagerDaemonActor (urn:uuid:3fc1a51e-f0eb-45ab-b
    ac1-91a894ce484e) to caller:
    2015-04-13_17:13:34.75342 Traceback (most recent call last):
    2015-04-13_17:13:34.75343   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 200, in _actor_loop
    2015-04-13_17:13:34.75343     response = self._handle_receive(message)
    2015-04-13_17:13:34.75344   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 294, in _handle_receive
    2015-04-13_17:13:34.75344     return callee(*message['args'], **message['kwargs'])
    2015-04-13_17:13:34.75345   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 251, in wrapper
    2015-04-13_17:13:34.75345     return orig_func(self, *args, **kwargs)
    2015-04-13_17:13:34.75345   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 302, in stop_booting_node
    2015-04-13_17:13:34.75345     node.stop_if_no_cloud_node().get()
    2015-04-13_17:13:34.75346   File "/usr/local/lib/python2.7/dist-packages/pykka/future.py", line 299, in get
    2015-04-13_17:13:34.75346     exec('raise exc_info[0], exc_info[1], exc_info[2]')
    2015-04-13_17:13:34.75346   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 470, in ask
    2015-04-13_17:13:34.75346     self.tell(message)
    2015-04-13_17:13:34.75347   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 437, in tell
    2015-04-13_17:13:34.75347     raise _ActorDeadError('%s not found' % self)
    2015-04-13_17:13:34.75348 ActorDeadError: ComputeNodeSetupActor (urn:uuid:4d7a01a2-3f7d-44b9-83ea-ee3ea50e36b1) not found
    

There are only two places where the daemon stops a setup actor: node_up, and stop_booting_node. node_up unconditionally stops the actor and removes it from self.booting. stop_booting_node previously sent a message and then did a check before removing it from self.booting. The patch in the branch closes the gap, and so I'm more confident that it fixes the real bug we're seeing now. Especially since Node Manager stopped two setup actors, but only kept sending messages to one of them—that suggests that there's a race condition at fault, rather than an unconditional code bug.

Actions #4

Updated by Ward Vandewege about 9 years ago

5714-gce-setup-bugfixes-wip LGTM

Actions #5

Updated by Brett Smith about 9 years ago

  • Status changed from New to Resolved

Applied in changeset arvados|commit:0d8d66df56992a39cc032ba482e1ff88de7f22ab.

Actions

Also available in: Atom PDF