Project

General

Profile

Actions

Bug #8205

closed

Node manager is not spinning up nodes.

Added by Nico César over 8 years ago. Updated almost 7 years ago.

Status:
Duplicate
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

In a GCE cluster I got the following

2016-01-14_11:52:51.45821 2016-01-14 11:52:51 arvnodeman.jobqueue[20780] DEBUG: JobQueueMonitorActor (at 44990544) sending poll
2016-01-14_11:52:51.46085 2016-01-14 11:52:51 arvnodeman.arvados_nodes[20780] DEBUG: ArvadosNodeListMonitorActor (at 44613072) sending poll
2016-01-14_11:52:51.46445 2016-01-14 11:52:51 arvnodeman.cloud_nodes[20780] DEBUG: CloudNodeListMonitorActor (at 40368976) sending poll
2016-01-14_11:52:51.62923 2016-01-14 11:52:51 arvnodeman.jobqueue[20780] DEBUG: Sending server wishlist: n1-standard-16, n1-standard-16, n1-standard-1, n1-standard-1, n1-st
andard-1, n1-standard-1, n1-standard-1, n1-standard-1
2016-01-14_11:52:51.62925 2016-01-14 11:52:51 arvnodeman.jobqueue[20780] DEBUG: JobQueueMonitorActor (at 44990544) got response with 8 items
2016-01-14_11:52:51.83040 2016-01-14 11:52:51 arvnodeman.arvados_nodes[20780] DEBUG: ArvadosNodeListMonitorActor (at 44613072) got response with 40 items
2016-01-14_11:52:53.46915 2016-01-14 11:52:53 arvnodeman.nodeup[20780] INFO: Creating cloud node with size n1-standard-1.
2016-01-14_11:52:55.25560 2016-01-14 11:52:55 arvnodeman.cloud_nodes[20780] DEBUG: CloudNodeListMonitorActor (at 40368976) got response with 9 items
2016-01-14_11:52:57.57508 2016-01-14 11:52:57 arvnodeman.nodeup[20780] WARNING: Client error: u"The resource 'projects/curoverse-qr2hi/zones/europe-west1-d/instances/comput
e-mbgiqs508488t4x-qr2hi' already exists" - waiting 16 seconds
2016-01-14_11:52:59.95241 2016-01-14 11:52:59 arvnodeman.nodedown[20780] WARNING: Client error: <LibcloudError in None 'Job did not complete in 20 seconds'> - waiting 1 sec
onds
2016-01-14_11:52:59.95735 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-32: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.95876 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-16: idle nodes 3, wishlist size 2
2016-01-14_11:52:59.95927 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-8: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.95974 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-4: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.96017 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-2: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.96229 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-1: idle nodes 6, wishlist size 6
2016-01-14_11:52:59.97551 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-32: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.97693 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-16: idle nodes 3, wishlist size 2
2016-01-14_11:52:59.97733 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-8: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.97766 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-4: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.97798 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-2: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.97986 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-1: idle nodes 6, wishlist size 6
2016-01-14_11:52:59.99072 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-32: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.99196 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-16: idle nodes 3, wishlist size 2
2016-01-14_11:52:59.99233 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-8: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.99277 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-4: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.99319 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-2: idle nodes 0, wishlist size 0
2016-01-14_11:52:59.99523 2016-01-14 11:52:59 arvnodeman.daemon[20780] DEBUG: n1-standard-1: idle nodes 6, wishlist size 6
2016-01-14_11:53:00.49188 2016-01-14 11:53:00 arvnodeman.nodedown[20780] INFO: Cloud node 11421971399360328232 shutdown cancelled: shutdown window closed.
201

later in the log:

2016-01-14_11:53:06.67924 2016-01-14 11:53:06 pykka[20780] DEBUG: Exception returned from ComputeNodeUpdateActor (urn:uuid:73c0280a-a09b-46c7-9bbd-8b557fa12dea) to caller:
2016-01-14_11:53:06.67926 Traceback (most recent call last):
2016-01-14_11:53:06.67927   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 201, in _actor_loop
2016-01-14_11:53:06.67927     response = self._handle_receive(message)
2016-01-14_11:53:06.67927   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 295, in _handle_receive
2016-01-14_11:53:06.67928     return callee(*message['args'], **message['kwargs'])
2016-01-14_11:53:06.67928   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 287, in throttle_wrapper
2016-01-14_11:53:06.67929     result = orig_func(self, *args, **kwargs)
2016-01-14_11:53:06.67929   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 300, in sync_node
2016-01-14_11:53:06.67929     return self._cloud.sync_node(cloud_node, arvados_node)
2016-01-14_11:53:06.67929   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/gce.py", line 152, in sync_node
2016-01-14_11:53:06.67930     method='POST', data=metadata_req)
2016-01-14_11:53:06.67930   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py", line 937, in async_request
2016-01-14_11:53:06.67931     response = request(**kwargs)
2016-01-14_11:53:06.67931   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 120, in request
2016-01-14_11:53:06.67931     response = super(GCEConnection, self).request(*args, **kwargs)
2016-01-14_11:53:06.67932   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py", line 692, in request
2016-01-14_11:53:06.67932     *args, **kwargs)
2016-01-14_11:53:06.67932   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py", line 799, in request
2016-01-14_11:53:06.67932     response = responseCls(**kwargs)
2016-01-14_11:53:06.67933   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py", line 145, in __init__
2016-01-14_11:53:06.67933     self.object = self.parse_body()
2016-01-14_11:53:06.67933   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py", line 253, in parse_body
2016-01-14_11:53:06.67933     raise GoogleBaseError(message, self.status, code)
2016-01-14_11:53:06.67934 GoogleBaseError: u'Invalid fingerprint.'
Actions #1

Updated by Nico César over 8 years ago

According to #6666 it's a harmless bug, but I don't see the wishlist filling up and there are multiple PRECONDITION FAILED errors in the operations tab of GCE

{ "kind": "compute#operation", "id": "4374536936630954981", "name": "operation-1452665616574-52931120da630-930cd9a9-111cdb9d", 
"zone": "https://www.googleapis.com/compute/v1/projects/curoverse-qr2hi/zones/europe-west1-d",
 "operationType": "setMetadata", 
"targetLink": "https://www.googleapis.com/compute/v1/projects/curoverse-qr2hi/zones/europe-west1-d/instances/compute-igmykwyi9aphw62-qr2hi", "targetId": "13830135811226236730", 
"status": "DONE", "user": "node-manager@curoverse-qr2hi.iam.gserviceaccount.com", "progress": 100, "insertTime": "2016-01-12T22:13:36.885-08:00", 
"startTime": "2016-01-12T22:13:37.243-08:00", "endTime": "2016-01-12T22:13:37.541-08:00", "error": { "errors": [ { "code": "CONDITION_NOT_MET", 
"message": "Invalid fingerprint." } ] }, 
"httpErrorStatusCode": 412, "httpErrorMessage": "PRECONDITION FAILED", 
"selfLink": "https://www.googleapis.com/compute/v1/projects/curoverse-qr2hi/zones/europe-west1-d/operations/operation-1452665616574-52931120da630-930cd9a9-111cdb9d" }
Actions #2

Updated by Nico César over 8 years ago

  • Subject changed from "invalid fingerprint" to Node manager is not spinning up nodes.

Actually the fact that nodes arent coming up could be for many reasons.

This is another stacktrace:


2016-01-14_12:32:30.13301 2016-01-14 12:32:30 arvnodeman.nodedown[1599] INFO: Cloud node 2959009426029382928 shut down.
2016-01-14_12:32:30.13567 2016-01-14 12:32:30 pykka[1599] DEBUG: Unregistered ComputeNodeMonitorActor (urn:uuid:7556104b-d8b0-4c53-aa59-5659b4aafd75)
2016-01-14_12:32:30.13575 2016-01-14 12:32:30 pykka[1599] DEBUG: Stopped ComputeNodeMonitorActor (urn:uuid:7556104b-d8b0-4c53-aa59-5659b4aafd75)
2016-01-14_12:32:30.19862 2016-01-14 12:32:30 pykka[1599] DEBUG: Unregistered ComputeNodeShutdownActor (urn:uuid:d3f6ec70-5b19-4bc1-9baf-9c22114e331d)
2016-01-14_12:32:30.19874 2016-01-14 12:32:30 pykka[1599] DEBUG: Stopped ComputeNodeShutdownActor (urn:uuid:d3f6ec70-5b19-4bc1-9baf-9c22114e331d)
2016-01-14_12:32:30.19973 2016-01-14 12:32:30 pykka[1599] DEBUG: Exception returned from NodeManagerDaemonActor (urn:uuid:d211d590-d02f-4a94-b45e-6bd6000a0897) to caller:
2016-01-14_12:32:30.19974 Traceback (most recent call last):
2016-01-14_12:32:30.19974   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 201, in _actor_loop
2016-01-14_12:32:30.19974     response = self._handle_receive(message)
2016-01-14_12:32:30.19975   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 295, in _handle_receive
2016-01-14_12:32:30.19975     return callee(*message['args'], **message['kwargs'])
2016-01-14_12:32:30.19975   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 400, in node_finished_shutdown
2016-01-14_12:32:30.19976     shutdown_actor, 'cloud_node', 'success', 'cancel_reason')
2016-01-14_12:32:30.19977   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 347, in _get_actor_attrs
2016-01-14_12:32:30.19978     return pykka.get_all([getattr(actor, name) for name in attr_names])
2016-01-14_12:32:30.19978   File "/usr/local/lib/python2.7/dist-packages/pykka/future.py", line 273, in get_all
2016-01-14_12:32:30.19978     return [future.get(timeout=timeout) for future in futures]
2016-01-14_12:32:30.19979   File "/usr/local/lib/python2.7/dist-packages/pykka/threading.py", line 52, in get
2016-01-14_12:32:30.19979     compat.reraise(*self._data['exc_info'])
2016-01-14_12:32:30.19979   File "/usr/local/lib/python2.7/dist-packages/pykka/compat.py", line 12, in reraise
2016-01-14_12:32:30.19980     exec('raise tp, value, tb')
2016-01-14_12:32:30.19980   File "<string>", line 1, in <module>
2016-01-14_12:32:30.19980 ActorDeadError: ComputeNodeShutdownActor (urn:uuid:d3f6ec70-5b19-4bc1-9baf-9c22114e331d) stopped before handling the message

Actions #3

Updated by Ward Vandewege over 8 years ago

I think this is a duplicate of #8174

Actions #4

Updated by Tom Clegg almost 7 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF