Project

General

Profile

Actions

Bug #7667

closed

[NodeManager] CloudNodeListMonitorActor stopped reporting, and logs are not helpful enough to diagnose

Added by Peter Amstutz over 8 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
1.0

Description

CloudNodeListMonitorActor got stuck:

2015-10-28_00:19:16.88046 2015-10-28 00:19:16 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll
2015-10-28_00:21:18.80563 2015-10-28 00:21:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) got response with 43 items
2015-10-28_00:22:18.86194 2015-10-28 00:22:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll

After this, there are no log messages from CloudNodeListMonitorActor until we restarted the daemon. Then we saw a timeout:

2015-10-28_14:06:49.26831 Stopping arvados-node-manager
2015-10-28_14:06:49.27028 Starting arvados-node-manager from /etc/sv/arvados-node-manager
2015-10-28_14:06:53.75688 2015-10-28 14:06:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll
2015-10-28_14:08:19.41714 2015-10-28 14:08:19 arvnodeman.cloud_nodes[47139] WARNING: CloudNodeListMonitorActor (at 140587341204048) got error: The read operation timed out - waiting 120 seconds

After the timeout, it started working again:

2015-10-28_14:08:53.76068 2015-10-28 14:08:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll
2015-10-28_14:10:51.05526 2015-10-28 14:10:51 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) got response with 37 items

There are no stack traces, in the log which suggests it didn't crash, but did get into a state where it stopped reporting.


Subtasks 2 (0 open2 closed)

Task #8360: Review 7667-node-manager-loggingResolvedNico César02/10/2016Actions
Task #8359: Improve node manager loggingResolvedPeter Amstutz02/05/2016Actions
Actions

Also available in: Atom PDF