Project

General

Profile

Actions

Bug #8206

closed

[Node Manager] GCE compute node driver needs to retry I/O errors initializing libcloud driver

Added by Nico César about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
01/14/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

SSL error getting max_total_price=config.getfloat('Daemon', 'max_total_price')).proxy() will stop the node-manager-

here is the stacktrace:


2016-01-14_12:42:41.57638 Traceback (most recent call last):
2016-01-14_12:42:41.57641   File "/usr/local/bin/arvados-node-manager", line 6, in <module>
2016-01-14_12:42:41.57643     main()
2016-01-14_12:42:41.57643   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 125, in main
2016-01-14_12:42:41.59825     max_total_price=config.getfloat('Daemon', 'max_total_price')).proxy()
2016-01-14_12:42:41.59827   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 94, in start
2016-01-14_12:42:41.59827     obj = cls(*args, **kwargs)
2016-01-14_12:42:41.59829   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 123, in __init__
2016-01-14_12:42:41.59844     self._cloud_driver = self._new_cloud()
2016-01-14_12:42:41.59846   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/config.py", line 105, in new_cloud_client
2016-01-14_12:42:41.59846     self.get_section('Cloud Create'))
2016-01-14_12:42:41.59847   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/gce.py", line 36, in __init__
2016-01-14_12:42:41.59847     driver_class)
2016-01-14_12:42:41.59847   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 40, in __init__
2016-01-14_12:42:41.59848     self.real = driver_class(**auth_kwargs)
2016-01-14_12:42:41.59848   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 1053, in __init__
2016-01-14_12:42:41.59862     self.zone_list = self.ex_list_zones()
2016-01-14_12:42:41.59863   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 1785, in ex_list_zones
2016-01-14_12:42:41.59881     response = self.connection.request(request, method='GET').object
2016-01-14_12:42:41.59883   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 120, in request
2016-01-14_12:42:41.59889     response = super(GCEConnection, self).request(*args, **kwargs)
2016-01-14_12:42:41.59889   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py", line 698, in request
2016-01-14_12:42:41.59895     raise e
2016-01-14_12:42:41.59895 ssl.SSLError: The read operation timed out

seems that the nodemanager after that is stuck. will be good to retry or at least die gracefully.

Steps to fix:

Put self.real initialization into retry loop on cloud error.

Log error backtrace.


Subtasks 3 (0 open3 closed)

Task #8262: Review 8206-gce-retry-initResolvedPeter Amstutz01/14/2016

Actions
Task #8264: Add retry logicResolvedPeter Amstutz01/14/2016

Actions
Task #8265: Add testResolvedPeter Amstutz01/14/2016

Actions
Actions

Also available in: Atom PDF