Project

General

Profile

Actions

Bug #8206

closed

[Node Manager] GCE compute node driver needs to retry I/O errors initializing libcloud driver

Added by Nico César about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Story points:
1.0

Description

SSL error getting max_total_price=config.getfloat('Daemon', 'max_total_price')).proxy() will stop the node-manager-

here is the stacktrace:


2016-01-14_12:42:41.57638 Traceback (most recent call last):
2016-01-14_12:42:41.57641   File "/usr/local/bin/arvados-node-manager", line 6, in <module>
2016-01-14_12:42:41.57643     main()
2016-01-14_12:42:41.57643   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/launcher.py", line 125, in main
2016-01-14_12:42:41.59825     max_total_price=config.getfloat('Daemon', 'max_total_price')).proxy()
2016-01-14_12:42:41.59827   File "/usr/local/lib/python2.7/dist-packages/pykka/actor.py", line 94, in start
2016-01-14_12:42:41.59827     obj = cls(*args, **kwargs)
2016-01-14_12:42:41.59829   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 123, in __init__
2016-01-14_12:42:41.59844     self._cloud_driver = self._new_cloud()
2016-01-14_12:42:41.59846   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/config.py", line 105, in new_cloud_client
2016-01-14_12:42:41.59846     self.get_section('Cloud Create'))
2016-01-14_12:42:41.59847   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/gce.py", line 36, in __init__
2016-01-14_12:42:41.59847     driver_class)
2016-01-14_12:42:41.59847   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/driver/__init__.py", line 40, in __init__
2016-01-14_12:42:41.59848     self.real = driver_class(**auth_kwargs)
2016-01-14_12:42:41.59848   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 1053, in __init__
2016-01-14_12:42:41.59862     self.zone_list = self.ex_list_zones()
2016-01-14_12:42:41.59863   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 1785, in ex_list_zones
2016-01-14_12:42:41.59881     response = self.connection.request(request, method='GET').object
2016-01-14_12:42:41.59883   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/gce.py", line 120, in request
2016-01-14_12:42:41.59889     response = super(GCEConnection, self).request(*args, **kwargs)
2016-01-14_12:42:41.59889   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/google.py", line 698, in request
2016-01-14_12:42:41.59895     raise e
2016-01-14_12:42:41.59895 ssl.SSLError: The read operation timed out

seems that the nodemanager after that is stuck. will be good to retry or at least die gracefully.

Steps to fix:

Put self.real initialization into retry loop on cloud error.

Log error backtrace.


Subtasks 3 (0 open3 closed)

Task #8262: Review 8206-gce-retry-initResolvedPeter Amstutz01/14/2016Actions
Task #8264: Add retry logicResolvedPeter Amstutz01/14/2016Actions
Task #8265: Add testResolvedPeter Amstutz01/14/2016Actions
Actions #1

Updated by Brett Smith about 8 years ago

  • Subject changed from [NODEMANAGER] ssl.SSLError when reading config will get nodemanager stuck to [Node Manager] GCE compute node driver needs to retry I/O errors initializing libcloud driver
  • Category set to Node Manager

#8226 saw same thing from another part of code. Network call is inside libcloud driver init method, so we need to be able to retry that initialization. Need to check if that's safe. If not, maybe we can override __new__ to get what we need?

Actions #2

Updated by Brett Smith about 8 years ago

  • Target version set to Arvados Future Sprints
Actions #3

Updated by Peter Amstutz about 8 years ago

  • Description updated (diff)
  • Story points set to 1.0
Actions #4

Updated by Peter Amstutz about 8 years ago

  • Target version changed from Arvados Future Sprints to 2016-02-03 Sprint
Actions #5

Updated by Peter Amstutz about 8 years ago

  • Assigned To set to Peter Amstutz
Actions #6

Updated by Tom Clegg about 8 years ago

It seems a bit odd at this point to maintain the pretense that _retry() is "private" enough to have a leading underscore.

Also, now that it's divorced from the ComputeNodeStateChangeBase class, its docstring should mention a full list of assumptions it makes about the calling class (self.retry_wait, self.max_retry_wait, self.min_retry_wait, self._cloud, self._logger, ...?)

Alternatively, this seems like it could be refactored as a mixin, so those things can be passed to a constructor, they can have defaults, it's more obvious what's happening, etc.

Seems like self.retry_wait should get reset to self.min_retry_wait when re-raising an unhandled exception. (OTOH the previous version seems to have survived without this.)

The delay behavior seems to differ depending on whether there is a self._timer:
  • with a self._timer, attempt N+1 happens self._retry_wait seconds after the start time of attempt N
  • without a self._timer, attempt N+1 happens self._retry_wait seconds after the start time of the first attempt (this is probably unintentional; it seems to mean once we reach max_retry_wait we will stop delaying entirely)

How about having attempt N+1 happen self._retry_wait seconds after the end time of attempt N in both cases?

Actions #7

Updated by Peter Amstutz about 8 years ago

Tom Clegg wrote:

Alternatively, this seems like it could be refactored as a mixin, so those things can be passed to a constructor, they can have defaults, it's more obvious what's happening, etc.

Refactored to RetryMixin.

Seems like self.retry_wait should get reset to self.min_retry_wait when re-raising an unhandled exception. (OTOH the previous version seems to have survived without this.)

Fixed.

The delay behavior seems to differ depending on whether there is a self._timer:
  • with a self._timer, attempt N+1 happens self._retry_wait seconds after the start time of attempt N
  • without a self._timer, attempt N+1 happens self._retry_wait seconds after the start time of the first attempt (this is probably unintentional; it seems to mean once we reach max_retry_wait we will stop delaying entirely)

How about having attempt N+1 happen self._retry_wait seconds after the end time of attempt N in both cases?

Yep, that makes sense. Done.

Actions #8

Updated by Tom Clegg about 8 years ago

LGTM thanks

Actions #9

Updated by Tom Clegg about 8 years ago

Oh. But could you mock sleep() for the test? Otherwise this is nodemanager's first slow test. (Could assert sleep() was called as expected, too)

Actions #10

Updated by Peter Amstutz about 8 years ago

  • Status changed from New to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:40103b455d0df2bc47eaba422a72bf6ec4fd6808.

Actions

Also available in: Atom PDF