Bug #5842

[Node Manager] Too many nodes being started

Added by Peter Amstutz over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Brett Smith
Category:
Node Manager
Target version:
Start date:
05/06/2015
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
-

Description

Some time around 4:07 PM 4/28/2015 EDT, GATK Queue job qr1hi-8i9sb-e4ggp059unk0tjg started and queued 5 additional jobs. For some reason, NodeManager ended up starting 10 nodes.


Subtasks

Task #5931: Review 5842-node-manager-tighten-successful-boot-criteria-wipResolvedTom Clegg


Related issues

Related to Arvados - Bug #5917: Node Manager did not spin up enough nodes when I set min_nodes = 20Closed05/06/2015

Associated revisions

Revision 88b02042
Added by Brett Smith over 6 years ago

Merge branch '5842-node-manager-tighten-successful-boot-criteria-wip'

Closes #5842, #5931.

History

#1 Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)

#2 Updated by Brett Smith over 6 years ago

  • Subject changed from [NodeManager] Too many nodes being started to [Node Manager] Too many nodes being started
  • Category set to Node Manager

Theory: this is caused or exacerbated by a gap in time between when nodes ping Arvados and when they're resumed in SLURM.

Previously it was the case that there was very little time between when a compute node sent the first ping to Arvados, and when it was resumed in SLURM. (In fact, if I remember right, it would happen the other way around: it would be resumed, then ping Arvados.) Due to a combination of changes the boot process generally, the ping script specifically, and details of how Puppet applies configuration to compute nodes on GCE, it now happens that a node comes up, pings Arvados, does a couple of Puppet runs that don't completely succeed, then finally does a Puppet run that does succeed, and resumes itself in SLURM. This means there's a 2-3 minute gap between the first ping and the time the node is usable.

I'm not sure what Node Manager makes of the state during that gap of time. Without checking the code, I'm guessing that the API server reports the node down, and so Node Manager writes it off (effectively considering it to be in that "I don't know what's going on with this so I'm not going to touch it in case it's doing something" state), and boots replacements.

If I'm right, we might consider making changes in the API server rather than in Node Manager. The crunch_worker_state attribute has some arbitrary calculations that predate Node Manager. It might make sense to revisit it and return values that are more in line with the our thinking about nodes that's reflected in Node Manager.

#3 Updated by Brett Smith over 6 years ago

Here's a log snippet that just happened that I think supports my hypothesis. We start just as a new cloud node comes up:

2015-05-05_17:46:57.54359 2015-05-05 17:46:57 arvnodeman.cloud_nodes[21918] DEBUG: CloudNodeListMonitorActor (at 35220240) got response with 2 items
2015-05-05_17:47:18.84474 2015-05-05 17:47:18 arvnodeman.nodeup[21918] INFO: Cloud node 9049066606946556894 created.
2015-05-05_17:47:18.84502 2015-05-05 17:47:18 arvnodeman.nodeup[21918] INFO: 9049066606946556894 post-create work done.
2015-05-05_17:47:18.84618 2015-05-05 17:47:18 pykka[21918] DEBUG: Unregistered ComputeNodeSetupActor (urn:uuid:36d81cfd-dff5-4408-bff5-0630ebe28ebd)
2015-05-05_17:47:18.84699 2015-05-05 17:47:18 pykka[21918] DEBUG: Stopped ComputeNodeSetupActor (urn:uuid:36d81cfd-dff5-4408-bff5-0630ebe28ebd)
2015-05-05_17:47:18.84730 2015-05-05 17:47:18 pykka[21918] DEBUG: Registered ComputeNodeMonitorActor (urn:uuid:20e98cfb-8cb1-41e9-b17b-612beb16f461)
2015-05-05_17:47:18.84739 2015-05-05 17:47:18 pykka[21918] DEBUG: Starting ComputeNodeMonitorActor (urn:uuid:20e98cfb-8cb1-41e9-b17b-612beb16f461)
2015-05-05_17:47:18.84770 2015-05-05 17:47:18 arvnodeman.computenode[21918] DEBUG: Node 9049066606946556894 shutdown window closed.  Next at Tue May  5 18:07:52 2015.
2015-05-05_17:47:18.84873 2015-05-05 17:47:18 arvnodeman.cloud_nodes[21918] DEBUG: <pykka.proxy._CallableProxy object at 0x327a850> subscribed to events for '9049066606946556894'
2015-05-05_17:47:51.55591 2015-05-05 17:47:51 arvnodeman.cloud_nodes[21918] DEBUG: CloudNodeListMonitorActor (at 35220240) sending poll
2015-05-05_17:47:51.55800 2015-05-05 17:47:51 arvnodeman.arvados_nodes[21918] DEBUG: ArvadosNodeListMonitorActor (at 40244048) sending poll
2015-05-05_17:47:51.56557 2015-05-05 17:47:51 arvnodeman.jobqueue[21918] DEBUG: JobQueueMonitorActor (at 40337680) sending poll
2015-05-05_17:47:51.71964 2015-05-05 17:47:51 arvnodeman.jobqueue[21918] DEBUG: Sending server wishlist: n1-highmem-16
2015-05-05_17:47:51.71973 2015-05-05 17:47:51 arvnodeman.jobqueue[21918] DEBUG: JobQueueMonitorActor (at 40337680) got response with 1 items
2015-05-05_17:47:52.16805 2015-05-05 17:47:52 arvnodeman.arvados_nodes[21918] DEBUG: ArvadosNodeListMonitorActor (at 40244048) got response with 140 items
2015-05-05_17:47:57.19377 2015-05-05 17:47:57 arvnodeman.cloud_nodes[21918] DEBUG: CloudNodeListMonitorActor (at 35220240) got response with 3 items

Note that when the daemon got the server wishlist here, it didn't act. That's because the daemon knew that it was already booting a node to fulfill the request. But sure enough, it appears in the cloud node listings, already paired, pretty promptly:

2015-05-05_17:47:57.19593 2015-05-05 17:47:57 arvnodeman.daemon[21918] INFO: Registering new cloud node 9049066606946556894
2015-05-05_17:47:57.20170 2015-05-05 17:47:57 arvnodeman.daemon[21918] INFO: Cloud node 9049066606946556894 has associated with Arvados node qr1hi-7ekkf-7m4c9o5d6cznak9
2015-05-05_17:47:57.20231 2015-05-05 17:47:57 arvnodeman.arvados_nodes[21918] DEBUG: <pykka.proxy._CallableProxy object at 0x25d7210> subscribed to events for 'qr1hi-7ekkf-7m4c9o5d6cznak9'
2015-05-05_17:48:51.55562 2015-05-05 17:48:51 arvnodeman.arvados_nodes[21918] DEBUG: ArvadosNodeListMonitorActor (at 40244048) sending poll
2015-05-05_17:48:51.55824 2015-05-05 17:48:51 arvnodeman.cloud_nodes[21918] DEBUG: CloudNodeListMonitorActor (at 35220240) sending poll
2015-05-05_17:48:51.56175 2015-05-05 17:48:51 arvnodeman.jobqueue[21918] DEBUG: JobQueueMonitorActor (at 40337680) sending poll
2015-05-05_17:48:51.72234 2015-05-05 17:48:51 arvnodeman.jobqueue[21918] DEBUG: Sending server wishlist: n1-highmem-16
2015-05-05_17:48:51.72254 2015-05-05 17:48:51 arvnodeman.jobqueue[21918] DEBUG: JobQueueMonitorActor (at 40337680) got response with 1 items
2015-05-05_17:48:51.72564 2015-05-05 17:48:51 arvnodeman.daemon[21918] INFO: Want 1 more nodes.  Booting a n1-highmem-16 node.

Now all of a sudden the daemon decides to boot a node—even though there hasn't been any change to the number of cloud nodes, or to the number of servers in the wishlist. I think this is because the daemon now considers the booting node booted—but also considers it unusable, presumably because its crunch_worker_state is not idle.

Proposed fix: the daemon should not consider a node fully up (in current parlance, moved from self.booted to self.cloud_nodes) until it pairs and has a crunch_worker_state that isn't down. Failure to move to a non-down state should be considered a bootstrap failure, just like failure to pair.

#4 Updated by Brett Smith over 6 years ago

  • Assigned To set to Brett Smith
  • Target version changed from Bug Triage to 2015-05-20 sprint

Tackling now per discussion with Ward.

#5 Updated by Brett Smith over 6 years ago

  • Status changed from New to In Progress

#6 Updated by Tom Clegg over 6 years ago

_nodes_excess() also returns different (and now correct) numbers now that booting nodes are not counted in _nodes_busy(), right? E.g., if we have 5 booting + 5 idle, and len(wishlist)=5, _nodes_excess() used to return 10-5-5=0 but will now return 10-0-5=5.

LGTM, thanks.

#7 Updated by Brett Smith over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:88b0204247e448ba5f715cb01adba4597bb6aae7.

Also available in: Atom PDF