Bug #7667
closed[NodeManager] CloudNodeListMonitorActor stopped reporting, and logs are not helpful enough to diagnose
Description
CloudNodeListMonitorActor got stuck:
2015-10-28_00:19:16.88046 2015-10-28 00:19:16 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll 2015-10-28_00:21:18.80563 2015-10-28 00:21:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) got response with 43 items 2015-10-28_00:22:18.86194 2015-10-28 00:22:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll
After this, there are no log messages from CloudNodeListMonitorActor until we restarted the daemon. Then we saw a timeout:
2015-10-28_14:06:49.26831 Stopping arvados-node-manager 2015-10-28_14:06:49.27028 Starting arvados-node-manager from /etc/sv/arvados-node-manager 2015-10-28_14:06:53.75688 2015-10-28 14:06:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll 2015-10-28_14:08:19.41714 2015-10-28 14:08:19 arvnodeman.cloud_nodes[47139] WARNING: CloudNodeListMonitorActor (at 140587341204048) got error: The read operation timed out - waiting 120 seconds
After the timeout, it started working again:
2015-10-28_14:08:53.76068 2015-10-28 14:08:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll 2015-10-28_14:10:51.05526 2015-10-28 14:10:51 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) got response with 37 items
There are no stack traces, in the log which suggests it didn't crash, but did get into a state where it stopped reporting.
Updated by Peter Amstutz almost 9 years ago
- Target version set to 2015-11-11 sprint
Updated by Brett Smith almost 9 years ago
- Target version changed from 2015-11-11 sprint to Arvados Future Sprints
Updated by Brett Smith almost 9 years ago
Both Peter and I separately went over the code for the libcloud Azure ARM driver, and the Node Manager Azure driver, and convinced ourselves that the timeout is correctly being passed through from there the base libcloud connection class. A timeout is properly specified in our Azure cluster configuration files, too. So if the issue here is timeout-related, it's something more nuanced than just "not setting it."
Updated by Brett Smith almost 9 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Brett Smith over 8 years ago
- Target version set to Arvados Future Sprints
Updated by Tom Clegg over 8 years ago
Peter suggested a good first step would be adding/improving logging statements, so next time this happens we get more clues.
Updated by Tom Clegg over 8 years ago
- Subject changed from [NodeManager] CloudNodeListMonitorActor stopped reporting to [NodeManager] CloudNodeListMonitorActor stopped reporting, and logs are not helpful enough to diagnose
Updated by Peter Amstutz over 8 years ago
- Target version changed from Arvados Future Sprints to 2016-02-17 Sprint
Updated by Peter Amstutz over 8 years ago
Sample log output with the node manager on this branch:
2016-02-05 11:17:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.49520611763 seconds, next poll at 2016-02-05 11:18:57 2016-02-05 11:17:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-02-05 11:17:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.24660396576 seconds, next poll at 2016-02-05 11:18:57 2016-02-05 11:18:17 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 0 items in 20.3000109196 seconds, next poll at 2016-02-05 11:18:57 2016-02-05 11:18:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:18:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:18:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 1 items in 0.10978603363 seconds, next poll at 2016-02-05 11:19:57 2016-02-05 11:18:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 1, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-02-05 11:18:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 1, up 0 (booting 0, idle 0, busy 0), shutting down 0 2016-02-05 11:18:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Want 1 more Standard_D1 nodes. Booting a node. 2016-02-05 11:18:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 0.970148086548 seconds, next poll at 2016-02-05 11:19:57 2016-02-05 11:18:58 googleapiclient.discovery[1402] INFO: URL being requested: PUT https://c97qk.arvadosapi.com/arvados/v1/nodes/c97qk-7ekkf-b4ksh6y4v1lyy5i?alt=json 2016-02-05 11:18:59 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: Sending create_node request for node size Standard_D1. 2016-02-05 11:19:06 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: Cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-b4ksh6y4v1lyy5i-c97qk created. 2016-02-05 11:19:06 googleapiclient.discovery[1402] INFO: URL being requested: PUT https://c97qk.arvadosapi.com/arvados/v1/nodes/c97qk-7ekkf-b4ksh6y4v1lyy5i?alt=json 2016-02-05 11:19:06 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: c97qk-7ekkf-b4ksh6y4v1lyy5i updated properties. 2016-02-05 11:19:06 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-b4ksh6y4v1lyy5i-c97qk post-create work done. 2016-02-05 11:19:06 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: finished 2016-02-05 11:19:16 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 0 items in 19.2054839134 seconds, next poll at 2016-02-05 11:19:57 2016-02-05 11:19:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:19:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:19:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 1 items in 0.118844985962 seconds, next poll at 2016-02-05 11:20:57 2016-02-05 11:19:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 1, up 1 (booting 1, idle 0, busy 0), shutting down 0 2016-02-05 11:19:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.10505390167 seconds, next poll at 2016-02-05 11:20:57 2016-02-05 11:20:22 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 25.246958971 seconds, next poll at 2016-02-05 11:20:57 2016-02-05 11:20:22 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Registering new cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-b4ksh6y4v1lyy5i-c97qk 2016-02-05 11:20:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:20:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:20:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 1 items in 0.153210163116 seconds, next poll at 2016-02-05 11:21:57 2016-02-05 11:20:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 1, up 1 (booting 1, idle 0, busy 0), shutting down 0 2016-02-05 11:20:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.10489988327 seconds, next poll at 2016-02-05 11:21:57 2016-02-05 11:21:21 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 23.8909299374 seconds, next poll at 2016-02-05 11:21:57 2016-02-05 11:21:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:21:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:21:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.164697170258 seconds, next poll at 2016-02-05 11:22:57 2016-02-05 11:21:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 1, idle 0, busy 0), shutting down 0 2016-02-05 11:21:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.03316402435 seconds, next poll at 2016-02-05 11:22:57 2016-02-05 11:21:58 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Cloud node compute-b4ksh6y4v1lyy5i-c97qk is now paired with Arvados node c97qk-7ekkf-b4ksh6y4v1lyy5i 2016-02-05 11:22:19 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 22.2098510265 seconds, next poll at 2016-02-05 11:22:57 2016-02-05 11:22:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:22:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:22:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.264312028885 seconds, next poll at 2016-02-05 11:23:57 2016-02-05 11:22:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0 2016-02-05 11:22:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.10327005386 seconds, next poll at 2016-02-05 11:23:57 2016-02-05 11:23:26 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 29.5173959732 seconds, next poll at 2016-02-05 11:23:57 2016-02-05 11:23:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:23:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:23:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.414124965668 seconds, next poll at 2016-02-05 11:24:57 2016-02-05 11:23:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0 2016-02-05 11:23:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.3468849659 seconds, next poll at 2016-02-05 11:24:57 2016-02-05 11:24:23 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 26.1978158951 seconds, next poll at 2016-02-05 11:24:57 2016-02-05 11:24:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:24:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:24:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.101679086685 seconds, next poll at 2016-02-05 11:25:57 2016-02-05 11:24:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0 2016-02-05 11:24:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 0.968893051147 seconds, next poll at 2016-02-05 11:25:57 2016-02-05 11:25:22 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 24.9726760387 seconds, next poll at 2016-02-05 11:25:57 2016-02-05 11:25:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:25:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:25:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.121600151062 seconds, next poll at 2016-02-05 11:26:57 2016-02-05 11:25:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 1, busy 0), shutting down 0 2016-02-05 11:25:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 0.876556873322 seconds, next poll at 2016-02-05 11:26:57 2016-02-05 11:26:22 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 25.2926030159 seconds, next poll at 2016-02-05 11:26:57 2016-02-05 11:26:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:26:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:26:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.129631996155 seconds, next poll at 2016-02-05 11:27:57 2016-02-05 11:26:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 1, busy 0), shutting down 0 2016-02-05 11:26:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.01245999336 seconds, next poll at 2016-02-05 11:27:57 2016-02-05 11:27:00 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/discovery/v1/apis/arvados/v1/rest 2016-02-05 11:27:01 ComputeNodeShutdownActor.77891d9d8642.compute-b4ksh6y4v1lyy5i-c97qk[1402] INFO: Starting shutdown 2016-02-05 11:27:20 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 23.3203639984 seconds, next poll at 2016-02-05 11:27:57 2016-02-05 11:27:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:27:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:27:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.0978491306305 seconds, next poll at 2016-02-05 11:28:57 2016-02-05 11:27:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 0.996257066727 seconds, next poll at 2016-02-05 11:28:57 2016-02-05 11:28:21 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 24.1674129963 seconds, next poll at 2016-02-05 11:28:57 2016-02-05 11:28:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:28:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:28:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.149247169495 seconds, next poll at 2016-02-05 11:29:57 2016-02-05 11:28:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.09818291664 seconds, next poll at 2016-02-05 11:29:57 2016-02-05 11:29:04 ComputeNodeShutdownActor.77891d9d8642.compute-b4ksh6y4v1lyy5i-c97qk[1402] INFO: Shutdown success 2016-02-05 11:29:04 googleapiclient.discovery[1402] INFO: URL being requested: PUT https://c97qk.arvadosapi.com/arvados/v1/nodes/c97qk-7ekkf-b4ksh6y4v1lyy5i?alt=json 2016-02-05 11:29:05 ComputeNodeShutdownActor.77891d9d8642.compute-b4ksh6y4v1lyy5i-c97qk[1402] INFO: finished 2016-02-05 11:29:05 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 1 2016-02-05 11:29:05 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 1 2016-02-05 11:29:22 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 0 items in 25.129185915 seconds, next poll at 2016-02-05 11:29:57 2016-02-05 11:29:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json 2016-02-05 11:29:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000 2016-02-05 11:29:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.185049057007 seconds, next poll at 2016-02-05 11:30:57 2016-02-05 11:29:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
Updated by Peter Amstutz over 8 years ago
Bugs fixed:
- ComputeNodeSetupActor and ComputeNodeShutdownActor would get stuck forever if an unhandled exception happened during node creation/shutdown. This will now result in the actor signaling an early finish to its subscribers.
- ComputeNodeMonitorActor will consider shutdown after
cloud_node_start_time + boot_fail_after
, so failed nodes that were not booted in the current process still have a chance to be shut down.
- Added exception handlers in other key places
Updated by Brett Smith over 8 years ago
Is this just open to go back and investgiate the original reported issue of not responding? That makes sense and it's fine if so, just checking for sprint planning purposes.
Updated by Brett Smith over 8 years ago
- Assigned To deleted (
Peter Amstutz) - Target version changed from 2016-02-17 Sprint to Arvados Future Sprints
Updated by Brett Smith over 8 years ago
This happened again as of Node Manager 0.1.20160315133517. There was no additional information in the logs. The situation was exactly the same: CloudNodeListMonitorActor logged that it sent a request, and then was never heard from again.
Updated by Brett Smith over 8 years ago
We expect this to be mitigated by #8236.
Updated by Peter Amstutz about 6 years ago
- Status changed from In Progress to Closed
Closing this since it is out of date.
Updated by Tom Morris almost 6 years ago
- Target version deleted (
Arvados Future Sprints)