Bug #8913

[Nodemanager] On GCE: 'unicode' object has no attribute 'id', where we should have a NodeSize

Added by Nico César about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
-
Target version:
Start date:
04/08/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

This happened in qr2hi: (I don't know if this exceptions are the cause of the manager being wedged or not. ) I restarted the service and the nodes were created.

# grep Traceback arvados-node-manager/log/main/current  -A28
2016-04-08_18:00:17.44134 Traceback (most recent call last):
2016-04-08_18:00:17.44134   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 128, in main
2016-04-08_18:00:17.44135     signal.pause()
2016-04-08_18:00:17.44136   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 90, in shutdown_signal
2016-04-08_18:00:17.44136     node_daemon.shutdown()
2016-04-08_18:00:17.44136   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/baseactor.py", line 25, in __call__
2016-04-08_18:00:17.44137     self.actor_ref.tell(message)
2016-04-08_18:00:17.44137   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 398, in tell
2016-04-08_18:00:17.44137     raise ActorDeadError('%s not found' % self)
2016-04-08_18:00:17.44137 ActorDeadError: NodeManagerDaemonActor (urn:uuid:e9844486-0662-4b73-bc46-8e64f57ac168) not found
2016-04-08_18:00:17.44211 2016-04-08 18:00:17 pykka[29660] DEBUG: Unregistered ComputeNodeMonitorActor (urn:uuid:1c85ed8e-3b54-43fb-80eb-9cd3a5a9738f)
@400000005707f6132670d7dc.s (88.2 KB) @400000005707f6132670d7dc.s Nico César, 04/08/2016 06:22 PM

Subtasks

Task #8923: Review 8912-node-manager-patch-nodes-wipResolvedPeter Amstutz

Associated revisions

Revision 788b8d72
Added by Brett Smith about 5 years ago

Merge branch '8912-node-manager-patch-nodes-wip'

Closes #8913, #8923. (The branch name has a typo.)

History

#1 Updated by Brett Smith about 5 years ago

The last traceback you pasted, the one you based the subject on, is #6225.

The ActorDeadError above that is more interesting, that's almost always going to be a problem. More logs before that would be good to see.

#2 Updated by Nico César about 5 years ago

  • File @400000005707f6132670d7dc.s @400000005707f6132670d7dc.s added
  • Project changed from Arvados to Arvados Private
  • Subject changed from [Nodemanager] GCE returns "Supplied fingerprint does not match current metadata fingerprint" to [Nodemanager] GCE returns "ActorDead"

#3 Updated by Nico César about 5 years ago

  • Description updated (diff)

yes... I guess the fingerprint it's irrelevant. Probably we should not transform that traceback into a WARNING or something.

I added a log that has the ActorDead. moved to Arvados private just because it has a log.

#4 Updated by Brett Smith about 5 years ago

  • Project changed from Arvados Private to Arvados
  • Subject changed from [Nodemanager] GCE returns "ActorDead" to [Nodemanager] 'unicode' object has no attribute 'id'

The original error was aaallllll the way back here:

2016-04-06_16:52:29.77830 2016-04-06 16:52:29 NodeManagerDaemonActor.8e64f57ac168[29660]
 ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper ob
ject at 0x261ce90>
2016-04-06_16:52:29.77831 Traceback (most recent call last):
2016-04-06_16:52:29.77831   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daem
on.py", line 326, in update_server_wishlist
2016-04-06_16:52:29.77831     nodes_wanted = self._nodes_wanted(size)
2016-04-06_16:52:29.77831   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daem
on.py", line 285, in _nodes_wanted
2016-04-06_16:52:29.77832     total_price = self._total_price()
2016-04-06_16:52:29.77833   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 250, in _total_price
2016-04-06_16:52:29.77834     for i in (self.booted, self.cloud_nodes.nodes)
2016-04-06_16:52:29.77834   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 251, in <genexpr>
2016-04-06_16:52:29.77834     for c in i.itervalues())
2016-04-06_16:52:29.77835 AttributeError: 'unicode' object has no attribute 'id'

From this point on, the daemon actor was dead. The traceback in the description only happened after someone tried to stop the process, and the stopping process failed because the daemon was already dead--the exception came from the signal handler.

#5 Updated by Brett Smith about 5 years ago

  • Subject changed from [Nodemanager] 'unicode' object has no attribute 'id' to [Nodemanager] On GCE: 'unicode' object has no attribute 'id', where we should have a NodeSize

#6 Updated by Brett Smith about 5 years ago

This is a bug introduced by #8872. The node returned by search_for doesn't have its size attribute fixed.

#7 Updated by Brett Smith about 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Brett Smith
  • Target version set to 2016-04-13 sprint

#8 Updated by Peter Amstutz about 5 years ago

Brett Smith wrote:

The original error was aaallllll the way back here:

[...]

From this point on, the daemon actor was dead. The traceback in the description only happened after someone tried to stop the process, and the stopping process failed because the daemon was already dead--the exception came from the signal handler.

Related to this, perhaps on_failure() should kill self on all unhandled exceptions and not just certain ones? Currently the policy is to handle recoverable exceptions before it gets to on_failure(), so once an exception gets to on_failure() it means an actor is going to die unexpectedly, which generally results in node manager getting wedged. (Filed a separate report #8932)

#9 Updated by Peter Amstutz about 5 years ago

The fix in 8912-node-manager-patch-nodes-wip 8db9ad8 LGTM.

#10 Updated by Brett Smith about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:788b8d7247da8c4592b1f9d482fff4e1509f57f3.

Also available in: Atom PDF