Story #8894

[Node Manager] Less logging of tracebacks for routine exceptions

Added by Nico César over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
04/06/2016
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

It would be useful for ops if it were clearer that exception tracebacks represented unhandled errors in Node Manager. For exceptions that are being handled by the code, like exceptions being retried in the RetryMixin, it would help if they were logged less verbosely. For an example of an exception traceback that actually got handled fine:

2016-04-02_12:03:45.58984 2016-04-02 12:03:44 ComputeNodeShutdownActor.feddab943501.compute-xue76h3ns5mmkty-qr1hi[55464] WARNING: Client error: <LibcloudError in <libcloud.
common.azure.AzureResponse object at 0x7f7c008ab250> 'Unknown error Status code: 500.'> - waiting 1 seconds
2016-04-02_12:03:45.58986 Traceback (most recent call last):
2016-04-02_12:03:45.58987   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 74, in retry_wrapper
2016-04-02_12:03:45.58988     ret = orig_func(self, *args, **kwargs)
2016-04-02_12:03:45.58988   File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 227, in shutdown_node
2016-04-02_12:03:45.58989     if not self._cloud.destroy_node(self.cloud_node):
2016-04-02_12:03:45.58989   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/azure_arm.py", line 654, in destroy_node
2016-04-02_12:03:45.58990     node.extra["properties"]["storageProfile"]["osDisk"]["vhd"]["uri"])
2016-04-02_12:03:45.58991   File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/azure_arm.py", line 1108, in _ex_delete_old_vhd
2016-04-02_12:03:45.58991     blob))
2016-04-02_12:03:45.58992   File "/usr/local/lib/python2.7/dist-packages/libcloud/storage/drivers/azure_blobs.py", line 453, in get_object
2016-04-02_12:03:45.58993     response = self.connection.request(object_path, method='HEAD')
2016-04-02_12:03:45.58994   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py", line 799, in request
2016-04-02_12:03:45.58995     response = responseCls(**kwargs)
2016-04-02_12:03:45.58995   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py", line 142, in __init__
2016-04-02_12:03:45.58996     message=self.parse_error(),
2016-04-02_12:03:45.58996   File "/usr/local/lib/python2.7/dist-packages/libcloud/common/azure.py", line 91, in parse_error
2016-04-02_12:03:45.58997     driver=self
2016-04-02_12:03:45.58998 LibcloudError: <LibcloudError in <libcloud.common.azure.AzureResponse object at 0x7f7c008ab250> 'Unknown error Status code: 500.'>

Maybe we could log these tracebacks at a lower logging level, and ops could configure a higher logging level by default?

History

#1 Updated by Brett Smith over 3 years ago

Note that Node Manager's retry code logs the full traceback for every exception it handles. The fact that a traceback appeared, by itself, doesn't mean that the exception was unhandled.

#2 Updated by Brett Smith over 3 years ago

The ShutdownActor and MonitorActor were stopped immediately after the exception:

2016-04-02_12:03:45.59123 2016-04-02 12:03:45 pykka[55464] DEBUG: Unregistered ComputeNodeMonitorActor (urn:uuid:ed769f41-f414-4db9-bafb-2826815a445e)
2016-04-02_12:03:45.59125 2016-04-02 12:03:45 pykka[55464] DEBUG: Stopped ComputeNodeMonitorActor (urn:uuid:ed769f41-f414-4db9-bafb-2826815a445e)
2016-04-02_12:03:45.59142 2016-04-02 12:03:45 pykka[55464] DEBUG: Unregistered ComputeNodeShutdownActor (urn:uuid:2c32c6e7-80a6-4d92-8270-feddab943501)
2016-04-02_12:03:45.59157 2016-04-02 12:03:45 pykka[55464] DEBUG: Stopped ComputeNodeShutdownActor (urn:uuid:2c32c6e7-80a6-4d92-8270-feddab943501)

#3 Updated by Nico César over 3 years ago

Brett Smith wrote:

Note that Node Manager's retry code logs the full traceback for every exception it handles. The fact that a traceback appeared, by itself, doesn't mean that the exception was unhandled.

Yes. You're right about this. Actually in #8896 we can see that at 2016-04-06 03:40:23 nodes were coming up as expected.

I think we can close this ticket since it's a false positive for the symptoms of "nodemanager being wedged",

Personally I would like not to see tracebacks on the log file. But it's a much lower priority issue regarding that.

#4 Updated by Brett Smith over 3 years ago

Nico Cesar wrote:

Brett Smith wrote:

Note that Node Manager's retry code logs the full traceback for every exception it handles. The fact that a traceback appeared, by itself, doesn't mean that the exception was unhandled.

Yes. You're right about this. Actually in #8896 we can see that at 2016-04-06 03:40:23 nodes were coming up as expected.

Yeah. It looks like what happened is that Libcloud timed out waiting for a response to its shutdown request, but Azure did receive and act on the request, and Node Manager noticed this through the cloud node listings.

Personally I would like not to see tracebacks on the log file. But it's a much lower priority issue regarding that.

We can make the ticket about that, sure.

#5 Updated by Brett Smith over 3 years ago

  • Tracker changed from Bug to Story
  • Subject changed from [NODEMANAGER] libcloud.common.azure.AzureResponse was a LibcloudError. Nodemanager wedged to [Node Manager] Less logging of tracebacks for routine exceptions
  • Description updated (diff)

#6 Updated by Nico César over 3 years ago

see #8996 for another warning that shouldn't appear

#7 Updated by Brett Smith over 3 years ago

  • Target version set to Arvados Future Sprints

Also available in: Atom PDF