Bug #8913
closed[Nodemanager] On GCE: 'unicode' object has no attribute 'id', where we should have a NodeSize
Description
This happened in qr2hi: (I don't know if this exceptions are the cause of the manager being wedged or not. ) I restarted the service and the nodes were created.
# grep Traceback arvados-node-manager/log/main/current -A28 2016-04-08_18:00:17.44134 Traceback (most recent call last): 2016-04-08_18:00:17.44134 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 128, in main 2016-04-08_18:00:17.44135 signal.pause() 2016-04-08_18:00:17.44136 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 90, in shutdown_signal 2016-04-08_18:00:17.44136 node_daemon.shutdown() 2016-04-08_18:00:17.44136 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/baseactor.py", line 25, in __call__ 2016-04-08_18:00:17.44137 self.actor_ref.tell(message) 2016-04-08_18:00:17.44137 File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 398, in tell 2016-04-08_18:00:17.44137 raise ActorDeadError('%s not found' % self) 2016-04-08_18:00:17.44137 ActorDeadError: NodeManagerDaemonActor (urn:uuid:e9844486-0662-4b73-bc46-8e64f57ac168) not found 2016-04-08_18:00:17.44211 2016-04-08 18:00:17 pykka[29660] DEBUG: Unregistered ComputeNodeMonitorActor (urn:uuid:1c85ed8e-3b54-43fb-80eb-9cd3a5a9738f)
Files
Updated by Brett Smith over 8 years ago
The last traceback you pasted, the one you based the subject on, is #6225.
The ActorDeadError above that is more interesting, that's almost always going to be a problem. More logs before that would be good to see.
Updated by Nico César over 8 years ago
- File @400000005707f6132670d7dc.s @400000005707f6132670d7dc.s added
- Project changed from Arvados to 35
- Subject changed from [Nodemanager] GCE returns "Supplied fingerprint does not match current metadata fingerprint" to [Nodemanager] GCE returns "ActorDead"
Updated by Nico César over 8 years ago
- Description updated (diff)
yes... I guess the fingerprint it's irrelevant. Probably we should not transform that traceback into a WARNING or something.
I added a log that has the ActorDead. moved to Arvados private just because it has a log.
Updated by Brett Smith over 8 years ago
- Project changed from 35 to Arvados
- Subject changed from [Nodemanager] GCE returns "ActorDead" to [Nodemanager] 'unicode' object has no attribute 'id'
The original error was aaallllll the way back here:
2016-04-06_16:52:29.77830 2016-04-06 16:52:29 NodeManagerDaemonActor.8e64f57ac168[29660] ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper ob ject at 0x261ce90> 2016-04-06_16:52:29.77831 Traceback (most recent call last): 2016-04-06_16:52:29.77831 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daem on.py", line 326, in update_server_wishlist 2016-04-06_16:52:29.77831 nodes_wanted = self._nodes_wanted(size) 2016-04-06_16:52:29.77831 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daem on.py", line 285, in _nodes_wanted 2016-04-06_16:52:29.77832 total_price = self._total_price() 2016-04-06_16:52:29.77833 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 250, in _total_price 2016-04-06_16:52:29.77834 for i in (self.booted, self.cloud_nodes.nodes) 2016-04-06_16:52:29.77834 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 251, in <genexpr> 2016-04-06_16:52:29.77834 for c in i.itervalues()) 2016-04-06_16:52:29.77835 AttributeError: 'unicode' object has no attribute 'id'
From this point on, the daemon actor was dead. The traceback in the description only happened after someone tried to stop the process, and the stopping process failed because the daemon was already dead--the exception came from the signal handler.
Updated by Brett Smith over 8 years ago
- Subject changed from [Nodemanager] 'unicode' object has no attribute 'id' to [Nodemanager] On GCE: 'unicode' object has no attribute 'id', where we should have a NodeSize
Updated by Brett Smith over 8 years ago
This is a bug introduced by #8872. The node returned by search_for doesn't have its size attribute fixed.
Updated by Brett Smith over 8 years ago
- Status changed from New to In Progress
- Assigned To set to Brett Smith
- Target version set to 2016-04-13 sprint
Updated by Peter Amstutz over 8 years ago
Brett Smith wrote:
The original error was aaallllll the way back here:
[...]
From this point on, the daemon actor was dead. The traceback in the description only happened after someone tried to stop the process, and the stopping process failed because the daemon was already dead--the exception came from the signal handler.
Related to this, perhaps on_failure() should kill self on all unhandled exceptions and not just certain ones? Currently the policy is to handle recoverable exceptions before it gets to on_failure(), so once an exception gets to on_failure() it means an actor is going to die unexpectedly, which generally results in node manager getting wedged. (Filed a separate report #8932)
Updated by Peter Amstutz over 8 years ago
The fix in 8912-node-manager-patch-nodes-wip 8db9ad8 LGTM.
Updated by Brett Smith over 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:788b8d7247da8c4592b1f9d482fff4e1509f57f3.