Bug #7667

[NodeManager] CloudNodeListMonitorActor stopped reporting, and logs are not helpful enough to diagnose

Added by Peter Amstutz about 6 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
02/05/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
1.0

Description

CloudNodeListMonitorActor got stuck:

2015-10-28_00:19:16.88046 2015-10-28 00:19:16 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll
2015-10-28_00:21:18.80563 2015-10-28 00:21:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) got response with 43 items
2015-10-28_00:22:18.86194 2015-10-28 00:22:18 arvnodeman.cloud_nodes[65149] DEBUG: CloudNodeListMonitorActor (at 140224248482384) sending poll

After this, there are no log messages from CloudNodeListMonitorActor until we restarted the daemon. Then we saw a timeout:

2015-10-28_14:06:49.26831 Stopping arvados-node-manager
2015-10-28_14:06:49.27028 Starting arvados-node-manager from /etc/sv/arvados-node-manager
2015-10-28_14:06:53.75688 2015-10-28 14:06:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll
2015-10-28_14:08:19.41714 2015-10-28 14:08:19 arvnodeman.cloud_nodes[47139] WARNING: CloudNodeListMonitorActor (at 140587341204048) got error: The read operation timed out - waiting 120 seconds

After the timeout, it started working again:

2015-10-28_14:08:53.76068 2015-10-28 14:08:53 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) sending poll
2015-10-28_14:10:51.05526 2015-10-28 14:10:51 arvnodeman.cloud_nodes[47139] DEBUG: CloudNodeListMonitorActor (at 140587341204048) got response with 37 items

There are no stack traces, in the log which suggests it didn't crash, but did get into a state where it stopped reporting.


Subtasks

Task #8360: Review 7667-node-manager-loggingResolvedNico C├ęsar

Task #8359: Improve node manager loggingResolvedPeter Amstutz

Associated revisions

Revision 4186924d
Added by Peter Amstutz almost 6 years ago

Merge branch '7667-node-manager-logging' refs #7667

History

#1 Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)

#2 Updated by Peter Amstutz about 6 years ago

  • Description updated (diff)

#3 Updated by Peter Amstutz about 6 years ago

  • Target version set to 2015-11-11 sprint

#4 Updated by Brett Smith about 6 years ago

  • Target version changed from 2015-11-11 sprint to Arvados Future Sprints

#5 Updated by Brett Smith about 6 years ago

Both Peter and I separately went over the code for the libcloud Azure ARM driver, and the Node Manager Azure driver, and convinced ourselves that the timeout is correctly being passed through from there the base libcloud connection class. A timeout is properly specified in our Azure cluster configuration files, too. So if the issue here is timeout-related, it's something more nuanced than just "not setting it."

#6 Updated by Brett Smith about 6 years ago

  • Target version deleted (Arvados Future Sprints)

#7 Updated by Brett Smith almost 6 years ago

  • Target version set to Arvados Future Sprints

#8 Updated by Tom Clegg almost 6 years ago

Peter suggested a good first step would be adding/improving logging statements, so next time this happens we get more clues.

#9 Updated by Tom Clegg almost 6 years ago

  • Subject changed from [NodeManager] CloudNodeListMonitorActor stopped reporting to [NodeManager] CloudNodeListMonitorActor stopped reporting, and logs are not helpful enough to diagnose

#10 Updated by Tom Clegg almost 6 years ago

  • Story points set to 1.0

#11 Updated by Peter Amstutz almost 6 years ago

  • Target version changed from Arvados Future Sprints to 2016-02-17 Sprint

#12 Updated by Peter Amstutz almost 6 years ago

  • Assigned To set to Peter Amstutz

#13 Updated by Peter Amstutz almost 6 years ago

Sample log output with the node manager on this branch:

2016-02-05 11:17:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.49520611763 seconds, next poll at 2016-02-05 11:18:57
2016-02-05 11:17:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-02-05 11:17:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.24660396576 seconds, next poll at 2016-02-05 11:18:57
2016-02-05 11:18:17 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 0 items in 20.3000109196 seconds, next poll at 2016-02-05 11:18:57
2016-02-05 11:18:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:18:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:18:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 1 items in 0.10978603363 seconds, next poll at 2016-02-05 11:19:57
2016-02-05 11:18:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 1, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-02-05 11:18:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 1, up 0 (booting 0, idle 0, busy 0), shutting down 0
2016-02-05 11:18:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Want 1 more Standard_D1 nodes.  Booting a node.
2016-02-05 11:18:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 0.970148086548 seconds, next poll at 2016-02-05 11:19:57
2016-02-05 11:18:58 googleapiclient.discovery[1402] INFO: URL being requested: PUT https://c97qk.arvadosapi.com/arvados/v1/nodes/c97qk-7ekkf-b4ksh6y4v1lyy5i?alt=json
2016-02-05 11:18:59 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: Sending create_node request for node size Standard_D1.
2016-02-05 11:19:06 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: Cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-b4ksh6y4v1lyy5i-c97qk created.
2016-02-05 11:19:06 googleapiclient.discovery[1402] INFO: URL being requested: PUT https://c97qk.arvadosapi.com/arvados/v1/nodes/c97qk-7ekkf-b4ksh6y4v1lyy5i?alt=json
2016-02-05 11:19:06 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: c97qk-7ekkf-b4ksh6y4v1lyy5i updated properties.
2016-02-05 11:19:06 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-b4ksh6y4v1lyy5i-c97qk post-create work done.
2016-02-05 11:19:06 ComputeNodeSetupActor.35f2bfc2f2c0[1402] INFO: finished
2016-02-05 11:19:16 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 0 items in 19.2054839134 seconds, next poll at 2016-02-05 11:19:57
2016-02-05 11:19:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:19:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:19:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 1 items in 0.118844985962 seconds, next poll at 2016-02-05 11:20:57
2016-02-05 11:19:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 1, up 1 (booting 1, idle 0, busy 0), shutting down 0
2016-02-05 11:19:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.10505390167 seconds, next poll at 2016-02-05 11:20:57
2016-02-05 11:20:22 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 25.246958971 seconds, next poll at 2016-02-05 11:20:57
2016-02-05 11:20:22 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Registering new cloud node /subscriptions/a731f419-596b-4b64-a278-364e76506b06/resourceGroups/c97qk/providers/Microsoft.Compute/virtualMachines/compute-b4ksh6y4v1lyy5i-c97qk
2016-02-05 11:20:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:20:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:20:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 1 items in 0.153210163116 seconds, next poll at 2016-02-05 11:21:57
2016-02-05 11:20:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 1, up 1 (booting 1, idle 0, busy 0), shutting down 0
2016-02-05 11:20:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.10489988327 seconds, next poll at 2016-02-05 11:21:57
2016-02-05 11:21:21 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 23.8909299374 seconds, next poll at 2016-02-05 11:21:57
2016-02-05 11:21:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:21:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:21:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.164697170258 seconds, next poll at 2016-02-05 11:22:57
2016-02-05 11:21:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 1, idle 0, busy 0), shutting down 0
2016-02-05 11:21:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.03316402435 seconds, next poll at 2016-02-05 11:22:57
2016-02-05 11:21:58 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Cloud node compute-b4ksh6y4v1lyy5i-c97qk is now paired with Arvados node c97qk-7ekkf-b4ksh6y4v1lyy5i
2016-02-05 11:22:19 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 22.2098510265 seconds, next poll at 2016-02-05 11:22:57
2016-02-05 11:22:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:22:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:22:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.264312028885 seconds, next poll at 2016-02-05 11:23:57
2016-02-05 11:22:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0
2016-02-05 11:22:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.10327005386 seconds, next poll at 2016-02-05 11:23:57
2016-02-05 11:23:26 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 29.5173959732 seconds, next poll at 2016-02-05 11:23:57
2016-02-05 11:23:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:23:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:23:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.414124965668 seconds, next poll at 2016-02-05 11:24:57
2016-02-05 11:23:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0
2016-02-05 11:23:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.3468849659 seconds, next poll at 2016-02-05 11:24:57
2016-02-05 11:24:23 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 26.1978158951 seconds, next poll at 2016-02-05 11:24:57
2016-02-05 11:24:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:24:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:24:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.101679086685 seconds, next poll at 2016-02-05 11:25:57
2016-02-05 11:24:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 0, busy 1), shutting down 0
2016-02-05 11:24:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 0.968893051147 seconds, next poll at 2016-02-05 11:25:57
2016-02-05 11:25:22 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 24.9726760387 seconds, next poll at 2016-02-05 11:25:57
2016-02-05 11:25:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:25:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:25:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.121600151062 seconds, next poll at 2016-02-05 11:26:57
2016-02-05 11:25:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 1, busy 0), shutting down 0
2016-02-05 11:25:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 0.876556873322 seconds, next poll at 2016-02-05 11:26:57
2016-02-05 11:26:22 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 25.2926030159 seconds, next poll at 2016-02-05 11:26:57
2016-02-05 11:26:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:26:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:26:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.129631996155 seconds, next poll at 2016-02-05 11:27:57
2016-02-05 11:26:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 1 (booting 0, idle 1, busy 0), shutting down 0
2016-02-05 11:26:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.01245999336 seconds, next poll at 2016-02-05 11:27:57
2016-02-05 11:27:00 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/discovery/v1/apis/arvados/v1/rest
2016-02-05 11:27:01 ComputeNodeShutdownActor.77891d9d8642.compute-b4ksh6y4v1lyy5i-c97qk[1402] INFO: Starting shutdown
2016-02-05 11:27:20 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 23.3203639984 seconds, next poll at 2016-02-05 11:27:57
2016-02-05 11:27:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:27:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:27:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.0978491306305 seconds, next poll at 2016-02-05 11:28:57
2016-02-05 11:27:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 0.996257066727 seconds, next poll at 2016-02-05 11:28:57
2016-02-05 11:28:21 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 1 items in 24.1674129963 seconds, next poll at 2016-02-05 11:28:57
2016-02-05 11:28:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:28:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:28:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.149247169495 seconds, next poll at 2016-02-05 11:29:57
2016-02-05 11:28:58 ArvadosNodeListMonitorActor.139670005887584[1402] INFO: got response with 258 items in 1.09818291664 seconds, next poll at 2016-02-05 11:29:57
2016-02-05 11:29:04 ComputeNodeShutdownActor.77891d9d8642.compute-b4ksh6y4v1lyy5i-c97qk[1402] INFO: Shutdown success
2016-02-05 11:29:04 googleapiclient.discovery[1402] INFO: URL being requested: PUT https://c97qk.arvadosapi.com/arvados/v1/nodes/c97qk-7ekkf-b4ksh6y4v1lyy5i?alt=json
2016-02-05 11:29:05 ComputeNodeShutdownActor.77891d9d8642.compute-b4ksh6y4v1lyy5i-c97qk[1402] INFO: finished
2016-02-05 11:29:05 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 1
2016-02-05 11:29:05 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 1
2016-02-05 11:29:22 CloudNodeListMonitorActor.139670033016992[1402] INFO: got response with 0 items in 25.129185915 seconds, next poll at 2016-02-05 11:29:57
2016-02-05 11:29:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/jobs/queue?alt=json
2016-02-05 11:29:57 googleapiclient.discovery[1402] INFO: URL being requested: GET https://c97qk.arvadosapi.com/arvados/v1/nodes?alt=json&limit=10000
2016-02-05 11:29:57 JobQueueMonitorActor.139669996440272[1402] INFO: got response with 0 items in 0.185049057007 seconds, next poll at 2016-02-05 11:30:57
2016-02-05 11:29:57 NodeManagerDaemonActor.1defb2a0d389[1402] INFO: Standard_D1: wishlist 0, up 0 (booting 0, idle 0, busy 0), shutting down 0

#14 Updated by Peter Amstutz almost 6 years ago

Bugs fixed:

  • ComputeNodeSetupActor and ComputeNodeShutdownActor would get stuck forever if an unhandled exception happened during node creation/shutdown. This will now result in the actor signaling an early finish to its subscribers.
  • ComputeNodeMonitorActor will consider shutdown after cloud_node_start_time + boot_fail_after, so failed nodes that were not booted in the current process still have a chance to be shut down.
  • Added exception handlers in other key places

#15 Updated by Brett Smith almost 6 years ago

Is this just open to go back and investgiate the original reported issue of not responding? That makes sense and it's fine if so, just checking for sprint planning purposes.

#16 Updated by Brett Smith almost 6 years ago

  • Status changed from New to In Progress

#17 Updated by Brett Smith almost 6 years ago

  • Assigned To deleted (Peter Amstutz)
  • Target version changed from 2016-02-17 Sprint to Arvados Future Sprints

#18 Updated by Brett Smith over 5 years ago

This happened again as of Node Manager 0.1.20160315133517. There was no additional information in the logs. The situation was exactly the same: CloudNodeListMonitorActor logged that it sent a request, and then was never heard from again.

#19 Updated by Brett Smith over 5 years ago

We expect this to be mitigated by #8236.

#20 Updated by Peter Amstutz over 3 years ago

  • Status changed from In Progress to Closed

Closing this since it is out of date.

#21 Updated by Tom Morris about 3 years ago

  • Target version deleted (Arvados Future Sprints)

Also available in: Atom PDF