Bug #7667
Updated by Peter Amstutz over 8 years ago
CloudNodeListMonitorActor got stuck:
<pre>
2015-10-28_00:19:16.88046 2015-10-28 00:19:16 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll
2015-10-28_00:21:18.80563 2015-10-28 00:21:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) got response with 43 items
2015-10-28_00:22:18.86194 2015-10-28 00:22:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll
</pre>
After this, there are no log messages from CloudNodeListMonitorActor until we restarted the daemon. Then we saw a timeout:
<pre>
2015-10-28_14:06:49.26831 Stopping arvados-node-manager
2015-10-28_14:06:49.27028 Starting arvados-node-manager from /etc/sv/arvados-node-manager
2015-10-28_14:06:53.75688 2015-10-28 14:06:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll
2015-10-28_14:08:19.41714 2015-10-28 14:08:19 arvnodeman.cloud_nodes[47139] WARNING: CloudNodeListMonitorActor (at 140587341204048) got error: The read operation timed out - waiting 120 seconds
</pre>
After the timeout, it started working again:
<pre>
2015-10-28_14:08:53.76068 2015-10-28 14:08:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll
2015-10-28_14:10:51.05526 2015-10-28 14:10:51 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) got response with 37 items
</pre>
There are no stack traces, in the log which suggests it didn't crash, but did get into a state where it stopped reporting.