Idea #8894
closed[Node Manager] Less logging of tracebacks for routine exceptions
Description
It would be useful for ops if it were clearer that exception tracebacks represented unhandled errors in Node Manager. For exceptions that are being handled by the code, like exceptions being retried in the RetryMixin, it would help if they were logged less verbosely. For an example of an exception traceback that actually got handled fine:
2016-04-02_12:03:45.58984 2016-04-02 12:03:44 ComputeNodeShutdownActor.feddab943501.compute-xue76h3ns5mmkty-qr1hi[55464] WARNING: Client error: <LibcloudError in <libcloud. common.azure.AzureResponse object at 0x7f7c008ab250> 'Unknown error Status code: 500.'> - waiting 1 seconds 2016-04-02_12:03:45.58986 Traceback (most recent call last): 2016-04-02_12:03:45.58987 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/__init__.py", line 74, in retry_wrapper 2016-04-02_12:03:45.58988 ret = orig_func(self, *args, **kwargs) 2016-04-02_12:03:45.58988 File "/usr/local/lib/python2.7/dist-packages/arvnodeman/computenode/dispatch/__init__.py", line 227, in shutdown_node 2016-04-02_12:03:45.58989 if not self._cloud.destroy_node(self.cloud_node): 2016-04-02_12:03:45.58989 File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/azure_arm.py", line 654, in destroy_node 2016-04-02_12:03:45.58990 node.extra["properties"]["storageProfile"]["osDisk"]["vhd"]["uri"]) 2016-04-02_12:03:45.58991 File "/usr/local/lib/python2.7/dist-packages/libcloud/compute/drivers/azure_arm.py", line 1108, in _ex_delete_old_vhd 2016-04-02_12:03:45.58991 blob)) 2016-04-02_12:03:45.58992 File "/usr/local/lib/python2.7/dist-packages/libcloud/storage/drivers/azure_blobs.py", line 453, in get_object 2016-04-02_12:03:45.58993 response = self.connection.request(object_path, method='HEAD') 2016-04-02_12:03:45.58994 File "/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py", line 799, in request 2016-04-02_12:03:45.58995 response = responseCls(**kwargs) 2016-04-02_12:03:45.58995 File "/usr/local/lib/python2.7/dist-packages/libcloud/common/base.py", line 142, in __init__ 2016-04-02_12:03:45.58996 message=self.parse_error(), 2016-04-02_12:03:45.58996 File "/usr/local/lib/python2.7/dist-packages/libcloud/common/azure.py", line 91, in parse_error 2016-04-02_12:03:45.58997 driver=self 2016-04-02_12:03:45.58998 LibcloudError: <LibcloudError in <libcloud.common.azure.AzureResponse object at 0x7f7c008ab250> 'Unknown error Status code: 500.'>
Maybe we could log these tracebacks at a lower logging level, and ops could configure a higher logging level by default?
Updated by Brett Smith about 8 years ago
Note that Node Manager's retry code logs the full traceback for every exception it handles. The fact that a traceback appeared, by itself, doesn't mean that the exception was unhandled.
Updated by Brett Smith about 8 years ago
The ShutdownActor and MonitorActor were stopped immediately after the exception:
2016-04-02_12:03:45.59123 2016-04-02 12:03:45 pykka[55464] DEBUG: Unregistered ComputeNodeMonitorActor (urn:uuid:ed769f41-f414-4db9-bafb-2826815a445e) 2016-04-02_12:03:45.59125 2016-04-02 12:03:45 pykka[55464] DEBUG: Stopped ComputeNodeMonitorActor (urn:uuid:ed769f41-f414-4db9-bafb-2826815a445e) 2016-04-02_12:03:45.59142 2016-04-02 12:03:45 pykka[55464] DEBUG: Unregistered ComputeNodeShutdownActor (urn:uuid:2c32c6e7-80a6-4d92-8270-feddab943501) 2016-04-02_12:03:45.59157 2016-04-02 12:03:45 pykka[55464] DEBUG: Stopped ComputeNodeShutdownActor (urn:uuid:2c32c6e7-80a6-4d92-8270-feddab943501)
Updated by Nico César about 8 years ago
Brett Smith wrote:
Note that Node Manager's retry code logs the full traceback for every exception it handles. The fact that a traceback appeared, by itself, doesn't mean that the exception was unhandled.
Yes. You're right about this. Actually in #8896 we can see that at 2016-04-06 03:40:23 nodes were coming up as expected.
I think we can close this ticket since it's a false positive for the symptoms of "nodemanager being wedged",
Personally I would like not to see tracebacks on the log file. But it's a much lower priority issue regarding that.
Updated by Brett Smith about 8 years ago
Nico Cesar wrote:
Brett Smith wrote:
Note that Node Manager's retry code logs the full traceback for every exception it handles. The fact that a traceback appeared, by itself, doesn't mean that the exception was unhandled.
Yes. You're right about this. Actually in #8896 we can see that at 2016-04-06 03:40:23 nodes were coming up as expected.
Yeah. It looks like what happened is that Libcloud timed out waiting for a response to its shutdown request, but Azure did receive and act on the request, and Node Manager noticed this through the cloud node listings.
Personally I would like not to see tracebacks on the log file. But it's a much lower priority issue regarding that.
We can make the ticket about that, sure.
Updated by Brett Smith about 8 years ago
- Tracker changed from Bug to Idea
- Subject changed from [NODEMANAGER] libcloud.common.azure.AzureResponse was a LibcloudError. Nodemanager wedged to [Node Manager] Less logging of tracebacks for routine exceptions
- Description updated (diff)
Updated by Nico César about 8 years ago
see #8996 for another warning that shouldn't appear
Updated by Brett Smith about 8 years ago
- Target version set to Arvados Future Sprints
Updated by Ward Vandewege over 3 years ago
- Target version deleted (
Arvados Future Sprints)