Project

General

Profile

Bug #7667

Updated by Peter Amstutz over 8 years ago

CloudNodeListMonitorActor got stuck: 

 <pre> 
 2015-10-28_00:19:16.88046 2015-10-28 00:19:16 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll 
 2015-10-28_00:21:18.80563 2015-10-28 00:21:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) got response with 43 items 
 2015-10-28_00:22:18.86194 2015-10-28 00:22:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll 
 </pre> 

 After this, there are no log messages from CloudNodeListMonitorActor until we restarted the daemon.    Then we saw a timeout: 

 <pre> 
 2015-10-28_14:06:49.26831 Stopping arvados-node-manager 
 2015-10-28_14:06:49.27028 Starting arvados-node-manager from /etc/sv/arvados-node-manager 
 2015-10-28_14:06:53.75688 2015-10-28 14:06:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll 
 2015-10-28_14:08:19.41714 2015-10-28 14:08:19 arvnodeman.cloud_nodes[47139] WARNING: CloudNodeListMonitorActor (at 140587341204048) got error: The read operation timed out - waiting 120 seconds 
 </pre> 

 After the timeout, it started working again: 

 <pre> 
 2015-10-28_14:08:53.76068 2015-10-28 14:08:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll 
 2015-10-28_14:10:51.05526 2015-10-28 14:10:51 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) got response with 37 items 
 </pre> 

 There are no stack traces, in the log which suggests it didn't crash, but did get into a state where it stopped reporting. 


Back