Project

General

Profile

Actions

Idea #4294

closed

[Node Manager] Add support for min_nodes configuration value

Added by Brett Smith over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
Node Manager
Target version:
Story points:
1.0

Description

When the daemon gets a new list of cloud nodes, if the total count of nodes is less than min_nodes, it should start one up.

The daemon should decline any shutdown offers from nodes when the total count of nodes is <= min_nodes.


Subtasks 4 (0 open4 closed)

Task #4355: min_nodes testsResolvedTim Pierce10/23/2014Actions
Task #4483: Review 4294-node-manager-min-nodesResolvedTim Pierce10/23/2014Actions
Task #4354: decline node shutdown if it brings the number of nodes lower than min_nodesResolvedTim Pierce10/23/2014Actions
Task #4353: spin up a new node if node list is less than min_nodesResolvedTim Pierce10/23/2014Actions
Actions #1

Updated by Brett Smith over 9 years ago

  • Category set to Node Manager
  • Target version set to Bug Triage
Actions #2

Updated by Brett Smith over 9 years ago

  • Tracker changed from Bug to Idea
Actions #3

Updated by Ward Vandewege over 9 years ago

  • Target version changed from Bug Triage to Arvados Future Sprints
Actions #4

Updated by Ward Vandewege over 9 years ago

  • Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Actions #5

Updated by Tim Pierce over 9 years ago

  • Assigned To set to Tim Pierce
Actions #6

Updated by Tim Pierce over 9 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Tim Pierce over 9 years ago

The tricky part with this feature is that it puts NodeManager in the position of needing to spin up new nodes without knowing how big they should be. When creating new nodes to satisfy min_nodes, do we want to do anything other than "create the smallest/cheapest node possible"?

Actions #8

Updated by Tim Pierce over 9 years ago

Ready for review at 3ee90fd

Because of the way the job queue is implemented, the min_nodes constraint is enforced in two places:
  • to make sure that at least min_nodes are brought up when the server gets a new wishlist, the jobqueue.ServerCalculator class ensures that the wishlist always has at least min_nodes elements in it.
  • to make sure that nodes are not spun down, the NodeManagerDaemonActor factors the min_nodes setting into the _nodes_excess() call.
Actions #9

Updated by Brett Smith over 9 years ago

Reviewing 3ee90fd

  • I know the story doesn't spec this, but it occurred to me relatively late: the daemon should spin up servers when the current number of cloud nodes is below min_nodes. Would it be a big deal to add that? (I think it's just a matter of a test and a change to _nodes_wanted…)
  • Please add the min_nodes setting to the example configuration files in doc/. Those are the best documentation Node Manager has right now, so it's valuable for it to have a complete catalog of available options.
  • Please add a ServerCalculator test that ensures it only goes up to min_nodes when there's a job in the queue (e.g., we don't always extend the server list by min_nodes).
  • Just as a readability thing, idle_nodes in _nodes_excess seems like the wrong name for the variable, since it starts with, well, the number of nodes that aren't idle. :)

Thanks.

Actions #10

Updated by Tim Pierce over 9 years ago

Brett Smith wrote:

Reviewing 3ee90fd

  • I know the story doesn't spec this, but it occurred to me relatively late: the daemon should spin up servers when the current number of cloud nodes is below min_nodes. Would it be a big deal to add that? (I think it's just a matter of a test and a change to _nodes_wanted…)

So, that's basically what I tried to do first, only I ran into trouble when start_node() tried to spin up a node when there weren't enough nodes in the wishlist, and that's when I talked to you.

Can you walk me through how this happens? Under what circumstances would the current number of cloud nodes be fewer than the current wishlist specifies?

  • Please add the min_nodes setting to the example configuration files in doc/. Those are the best documentation Node Manager has right now, so it's valuable for it to have a complete catalog of available options.

Good thought, done (ca136c5)

  • Please add a ServerCalculator test that ensures it only goes up to min_nodes when there's a job in the queue (e.g., we don't always extend the server list by min_nodes).

Hm. I thought that's actually what we wanted to do. In fact, I think test_server_calc_returns_at_least_min_nodes confirms that we do always spin up at least min_nodes even if there isn't a job in the queue. Sounds like I need more clarity on the existing and desired behavior here?

  • Just as a readability thing, idle_nodes in _nodes_excess seems like the wrong name for the variable, since it starts with, well, the number of nodes that aren't idle. :)

Ah, my fault for making rapid changes to the logic and not updating method/variable names to reflect those changes.

Actions #11

Updated by Brett Smith over 9 years ago

Tim Pierce wrote:

Brett Smith wrote:

Reviewing 3ee90fd

  • I know the story doesn't spec this, but it occurred to me relatively late: the daemon should spin up servers when the current number of cloud nodes is below min_nodes. Would it be a big deal to add that? (I think it's just a matter of a test and a change to _nodes_wanted…)

So, that's basically what I tried to do first, only I ran into trouble when start_node() tried to spin up a node when there weren't enough nodes in the wishlist, and that's when I talked to you.

Can you walk me through how this happens? Under what circumstances would the current number of cloud nodes be fewer than the current wishlist specifies?

Ohh, no, you're fine. I remembered that conversation, and I figured the changes to ServerCalculator were the way to deal with that, but I didn't think it through the rest of the way to realize that no change to _nodes_wanted would be necessary. What you have now is fine, thanks.

  • Please add a ServerCalculator test that ensures it only goes up to min_nodes when there's a job in the queue (e.g., we don't always extend the server list by min_nodes).

Hm. I thought that's actually what we wanted to do. In fact, I think test_server_calc_returns_at_least_min_nodes confirms that we do always spin up at least min_nodes even if there isn't a job in the queue. Sounds like I need more clarity on the existing and desired behavior here?

Sorry, unclear language on my part. Imagine a horribly buggy implementation that unconditionally did server_list.extend([cheapest_size.real] * self.min_nodes). The current test would pass because that code looks like it's right for the specific case where the job queue is empty. I'm asking for a second test to show that it handles both the "job queue has jobs" and "job queue is empty" cases correctly.

Actions #12

Updated by Tim Pierce over 9 years ago

Brett Smith wrote:

Sorry, unclear language on my part. Imagine a horribly buggy implementation that unconditionally did server_list.extend([cheapest_size.real] * self.min_nodes). The current test would pass because that code looks like it's right for the specific case where the job queue is empty. I'm asking for a second test to show that it handles both the "job queue has jobs" and "job queue is empty" cases correctly.

Oh, that makes sense. 7e8b026 has two new test cases:
  • test_server_calc_min_nodes_1_job
    • ServerCalculator returns a wishlist with min_nodes when there is one job in the queue
  • test_server_calc_more_jobs_than_min_nodes
    • ServerCalculator returns a wishlist exactly as long as the job queue, when the queue is between min_nodes and max_nodes
Actions #13

Updated by Brett Smith over 9 years ago

7e8b0267 is good to merge, thanks.

Actions #14

Updated by Tim Pierce over 9 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:e82cddfdbd763db27df889a6e316623df7b9c2a8.

Actions

Also available in: Atom PDF