Idea #4294
closed[Node Manager] Add support for min_nodes configuration value
Description
When the daemon gets a new list of cloud nodes, if the total count of nodes is less than min_nodes, it should start one up.
The daemon should decline any shutdown offers from nodes when the total count of nodes is <= min_nodes.
Updated by Brett Smith about 10 years ago
- Category set to Node Manager
- Target version set to Bug Triage
Updated by Ward Vandewege about 10 years ago
- Target version changed from Bug Triage to Arvados Future Sprints
Updated by Ward Vandewege about 10 years ago
- Target version changed from Arvados Future Sprints to 2014-11-19 sprint
Updated by Tim Pierce about 10 years ago
The tricky part with this feature is that it puts NodeManager in the position of needing to spin up new nodes without knowing how big they should be. When creating new nodes to satisfy min_nodes, do we want to do anything other than "create the smallest/cheapest node possible"?
Updated by Tim Pierce about 10 years ago
Ready for review at 3ee90fd
Because of the way the job queue is implemented, the min_nodes constraint is enforced in two places:- to make sure that at least min_nodes are brought up when the server gets a new wishlist, the jobqueue.ServerCalculator class ensures that the wishlist always has at least min_nodes elements in it.
- to make sure that nodes are not spun down, the NodeManagerDaemonActor factors the min_nodes setting into the
_nodes_excess()
call.
Updated by Brett Smith about 10 years ago
Reviewing 3ee90fd
- I know the story doesn't spec this, but it occurred to me relatively late: the daemon should spin up servers when the current number of cloud nodes is below min_nodes. Would it be a big deal to add that? (I think it's just a matter of a test and a change to
_nodes_wanted
…) - Please add the min_nodes setting to the example configuration files in doc/. Those are the best documentation Node Manager has right now, so it's valuable for it to have a complete catalog of available options.
- Please add a ServerCalculator test that ensures it only goes up to min_nodes when there's a job in the queue (e.g., we don't always extend the server list by min_nodes).
- Just as a readability thing,
idle_nodes
in_nodes_excess
seems like the wrong name for the variable, since it starts with, well, the number of nodes that aren't idle. :)
Thanks.
Updated by Tim Pierce about 10 years ago
Brett Smith wrote:
Reviewing 3ee90fd
- I know the story doesn't spec this, but it occurred to me relatively late: the daemon should spin up servers when the current number of cloud nodes is below min_nodes. Would it be a big deal to add that? (I think it's just a matter of a test and a change to
_nodes_wanted
…)
So, that's basically what I tried to do first, only I ran into trouble when start_node() tried to spin up a node when there weren't enough nodes in the wishlist, and that's when I talked to you.
Can you walk me through how this happens? Under what circumstances would the current number of cloud nodes be fewer than the current wishlist specifies?
- Please add the min_nodes setting to the example configuration files in doc/. Those are the best documentation Node Manager has right now, so it's valuable for it to have a complete catalog of available options.
Good thought, done (ca136c5)
- Please add a ServerCalculator test that ensures it only goes up to min_nodes when there's a job in the queue (e.g., we don't always extend the server list by min_nodes).
Hm. I thought that's actually what we wanted to do. In fact, I think test_server_calc_returns_at_least_min_nodes
confirms that we do always spin up at least min_nodes even if there isn't a job in the queue. Sounds like I need more clarity on the existing and desired behavior here?
- Just as a readability thing,
idle_nodes
in_nodes_excess
seems like the wrong name for the variable, since it starts with, well, the number of nodes that aren't idle. :)
Ah, my fault for making rapid changes to the logic and not updating method/variable names to reflect those changes.
Updated by Brett Smith about 10 years ago
Tim Pierce wrote:
Brett Smith wrote:
Reviewing 3ee90fd
- I know the story doesn't spec this, but it occurred to me relatively late: the daemon should spin up servers when the current number of cloud nodes is below min_nodes. Would it be a big deal to add that? (I think it's just a matter of a test and a change to
_nodes_wanted
…)So, that's basically what I tried to do first, only I ran into trouble when start_node() tried to spin up a node when there weren't enough nodes in the wishlist, and that's when I talked to you.
Can you walk me through how this happens? Under what circumstances would the current number of cloud nodes be fewer than the current wishlist specifies?
Ohh, no, you're fine. I remembered that conversation, and I figured the changes to ServerCalculator were the way to deal with that, but I didn't think it through the rest of the way to realize that no change to _nodes_wanted would be necessary. What you have now is fine, thanks.
- Please add a ServerCalculator test that ensures it only goes up to min_nodes when there's a job in the queue (e.g., we don't always extend the server list by min_nodes).
Hm. I thought that's actually what we wanted to do. In fact, I think
test_server_calc_returns_at_least_min_nodes
confirms that we do always spin up at least min_nodes even if there isn't a job in the queue. Sounds like I need more clarity on the existing and desired behavior here?
Sorry, unclear language on my part. Imagine a horribly buggy implementation that unconditionally did server_list.extend([cheapest_size.real] * self.min_nodes)
. The current test would pass because that code looks like it's right for the specific case where the job queue is empty. I'm asking for a second test to show that it handles both the "job queue has jobs" and "job queue is empty" cases correctly.
Updated by Tim Pierce about 10 years ago
Brett Smith wrote:
Oh, that makes sense. 7e8b026 has two new test cases:Sorry, unclear language on my part. Imagine a horribly buggy implementation that unconditionally did
server_list.extend([cheapest_size.real] * self.min_nodes)
. The current test would pass because that code looks like it's right for the specific case where the job queue is empty. I'm asking for a second test to show that it handles both the "job queue has jobs" and "job queue is empty" cases correctly.
- test_server_calc_min_nodes_1_job
- ServerCalculator returns a wishlist with min_nodes when there is one job in the queue
- test_server_calc_more_jobs_than_min_nodes
- ServerCalculator returns a wishlist exactly as long as the job queue, when the queue is between min_nodes and max_nodes
Updated by Tim Pierce about 10 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:e82cddfdbd763db27df889a6e316623df7b9c2a8.