Story #5353

[Node Manager] Support multiple node sizes and boot new nodes correctly from them

Added by Bryan Cosca over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
03/02/2015
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
3.0

Description

Correct me if I'm wrong but if we're in the cloud, we're able to pick out the specs that we want on each node, in order to save compute costs. Because I'm betting that more RAM costs more money. I doubt this could be dynamically allocated, but with trial and error, a bioinformatician should know how much they need to allocate.

for example:
assume job 1 requires 1 node with 50GB of ram, 2 cores, 100GB local space.
assume job 2 requires 2 nodes with 10GB of ram, 5 cores, 500GB local space.

Implementation

The Node Manager daemon currently treats the node size wishlist as homogeneous. For this change, it effectively needs to consider each size to be a separate wishlist, and make boot/shutdown decisions accordingly.

For each size S:

  • If there are more S nodes in the wishlist than S idle nodes running in the cloud, make sure a new S is booting.
  • If an S node is eligible for shutdown, and there are more S idle nodes running in the cloud than there are in the wishlist, start shutting down the node.
  • I'm not sure how often this will come up, but if it ever makes sense: it would generally be better to act on requests for smaller sizes before larger ones. This will help ensure that jobs that can fit in smaller nodes are dispatched to them, helping keep larger nodes available for jobs that actually require them. We understand that, due to limitations in Crunch, we won't always get the most cost-effective match, and that's fine. This change to Node Manager will make it easier for us to improve Crunch later.

Whenever the daemon currently accounts for booting or shutting down nodes in its math, you're going to have to do the same, but filtering the results out by size. This might be a reasonable time to refactor the daemon's internal data structures to make this easier.


Subtasks

Task #7678: Write testsResolvedPeter Amstutz

Task #7681: Update documentationResolvedPeter Amstutz

Task #7841: Test pipeline with multiple node sizesResolvedPeter Amstutz

Task #7679: Review 5353-node-sizesResolvedPeter Amstutz

Task #7677: Refactor to support multiple node typesResolvedPeter Amstutz

Associated revisions

Revision 58692c91
Added by Peter Amstutz almost 4 years ago

Merge branch '5353-node-sizes' closes #5353

Revision 76683cdf
Added by Peter Amstutz almost 4 years ago

Merge branch '5353-set-node-size' refs #5353

Revision d14d34b5 (diff)
Added by Peter Amstutz almost 4 years ago

Fix for _size_shutdowns and node prices in node manager refs #5353

Revision 8b3478bd
Added by Peter Amstutz almost 4 years ago

Merge branch '5353-booted-size' refs #5353

Revision 856c4e26 (diff)
Added by Peter Amstutz almost 4 years ago

Fix Azure node listing in node manager. refs #5353

History

#1 Updated by Brett Smith over 4 years ago

  • Subject changed from Allocating the correct amount of resources for each job in a pipeline to [Node Manager] Support multiple node sizes and boot new nodes correctly from them
  • Category set to Node Manager
  • Target version set to Arvados Future Sprints

#2 Updated by Brett Smith almost 4 years ago

The only thing we know we need to do here is that the daemon needs to count nodes by size, and make boot/shutdown decisions accordingly. Right now it just does a naive count of all nodes, and that won't be sufficient with multiple sizes.

The Azure ARM driver doesn't include cost information for node sizes, but that's not a blocker because we can fill in that information in the Node Manager configuration under each size.

#3 Updated by Peter Amstutz almost 4 years ago

Azure ARM billing API provides price information:

https://msdn.microsoft.com/en-us/library/azure/mt219004.aspx

#4 Updated by Brett Smith almost 4 years ago

  • Description updated (diff)
  • Target version changed from Arvados Future Sprints to 2015-11-11 sprint

#5 Updated by Brett Smith almost 4 years ago

  • Story points set to 3.0

#6 Updated by Peter Amstutz almost 4 years ago

  • Assigned To set to Peter Amstutz

#7 Updated by Peter Amstutz almost 4 years ago

- idea : wishlist entry can be list of several acceptable sizes instead of exactly one size

Algorithm:

inputs: wishlist, idle node list (including booting/booted) sorted by price
lowest-highest

While both lists still have items:
- Get front of wishlist
- Get 1st idle node that fit wishlist item
- Pop wishlist node & idle node

If idle node list is empty & wishlist isn't:
- Take 1st item in wishlist & boot cheapest node
- Schedule new run of algorithm

If wishlist is empty & idle node list isn't:
- Shut down most expensive idle node
- Schedule new run of algorithm

Notes:

- To schedule a smaller job on a bigger node immediately (instead of waiting for
smaller node to boot up) return multiple acceptable sizes in wishlist.

- If we want to run jobs on exactly the cheapest node size, then return exactly
one node size for each item in the wishlist; this won't assign wishlist items
to bigger nodes and instead will boot more nodes of the desired size. Once
idle list fufills wishlist, will start shutting down larger nodes that are
unused.

Brett, does this make sense?

#8 Updated by Brett Smith almost 4 years ago

Peter Amstutz wrote:

While both lists still have items:
- Get front of wishlist
- Get 1st idle node that fit wishlist item
- Pop wishlist node & idle node

Just a heads up: list.pop(0) is shockingly expensive in Python. You'll either want to sort most-to-least expensive so you can just do list.pop() (cheap), or use collections.deque.

If idle node list is empty & wishlist isn't:
- Take 1st item in wishlist & boot cheapest node
- Schedule new run of algorithm

For what it's worth, the current code does its "compare the wishlist vs. current nodes state, boot node if needed" logic each time it gets an updated wishlist. Since that happens regularly, it converges on the desired state pretty quickly, without acting overly fast. I believe you could follow that same basic strategy, rather than having separate scheduling.

If wishlist is empty & idle node list isn't:
- Shut down most expensive idle node
- Schedule new run of algorithm

This doesn't take cloud node billing cycles into account. We only make shutdown decisions when the node reports that it's eligible for shutdown (which considers both idleness and the billing cycle), and I think we should continue to follow that. You may need to store the manipulated wishlist+node list, and then use that information to decide whether or not to go ahead with a shutdown when a ComputeNodeMonitorActor sends the daemon node_can_shutdown.

Notes:

- To schedule a smaller job on a bigger node immediately (instead of waiting for
smaller node to boot up) return multiple acceptable sizes in wishlist.

You know that, fundamentally, the dispatch decision is not Node Manager's, but crunch-dispatch's. If idle nodes are available that satisfy the job, crunch-dispatch will use them, no matter what. Node Manager only has this level of control when there are N large nodes up, and a job needs M > N smaller nodes to run. Node Manager then has to decide whether to shut down the larger nodes (assuming that makes sense in the billing cycle), and boot M small nodes, or just leave the large nodes up and boot M - N small nodes.

That decision is fundamentally one of cost optimization that different administrators will decide differently. It's also one where the outcomes change as other components (crunch-dispatch, Crunch v2) change.

- If we want to run jobs on exactly the cheapest node size, then return exactly
one node size for each item in the wishlist; this won't assign wishlist items
to bigger nodes and instead will boot more nodes of the desired size. Once
idle list fufills wishlist, will start shutting down larger nodes that are
unused.

I think this will meet most of our users' needs for a while, and it will better suit Crunch v2 where there are only containers and they can't request multiple nodes. In the interest of keeping the story narrow, can we just plan on implementing only this strategy? This way we can avoid making larger changes to the wishlist actor, and we don't have to expose configuration knobs to the administrator to let them declare what they're willing to optimize for and how.

Brett, does this make sense?

Yes.

#9 Updated by Peter Amstutz almost 4 years ago

Brett Smith wrote:

If idle node list is empty & wishlist isn't:
- Take 1st item in wishlist & boot cheapest node
- Schedule new run of algorithm

For what it's worth, the current code does its "compare the wishlist vs. current nodes state, boot node if needed" logic each time it gets an updated wishlist. Since that happens regularly, it converges on the desired state pretty quickly, without acting overly fast. I believe you could follow that same basic strategy, rather than having separate scheduling.

I think you're misremembering. In the existing code, start_node() computes nodes_wanted(), boots one node if wanted > 0, and then reschedules start_node() if wanted > 1. I intend to preserve that behavior, except that nodes_wanted() will be returning a list of sizes rather than an integer.

If wishlist is empty & idle node list isn't:
- Shut down most expensive idle node
- Schedule new run of algorithm

This doesn't take cloud node billing cycles into account. We only make shutdown decisions when the node reports that it's eligible for shutdown (which considers both idleness and the billing cycle), and I think we should continue to follow that. You may need to store the manipulated wishlist+node list, and then use that information to decide whether or not to go ahead with a shutdown when a ComputeNodeMonitorActor sends the daemon node_can_shutdown.

To clarify, I meant "shut down the most expensive node that is eligible for shutdown" by first checking shutdown_eligible() (which takes billing cycles into account). However I see now that the decision is made on a ping from the monitor actor in node_can_shutdown().

Notes:

- To schedule a smaller job on a bigger node immediately (instead of waiting for
smaller node to boot up) return multiple acceptable sizes in wishlist.

You know that, fundamentally, the dispatch decision is not Node Manager's, but crunch-dispatch's. If idle nodes are available that satisfy the job, crunch-dispatch will use them, no matter what. Node Manager only has this level of control when there are N large nodes up, and a job needs M > N smaller nodes to run. Node Manager then has to decide whether to shut down the larger nodes (assuming that makes sense in the billing cycle), and boot M small nodes, or just leave the large nodes up and boot M - N small nodes.

That decision is fundamentally one of cost optimization that different administrators will decide differently. It's also one where the outcomes change as other components (crunch-dispatch, Crunch v2) change.

- If we want to run jobs on exactly the cheapest node size, then return exactly
one node size for each item in the wishlist; this won't assign wishlist items
to bigger nodes and instead will boot more nodes of the desired size. Once
idle list fufills wishlist, will start shutting down larger nodes that are
unused.

I think this will meet most of our users' needs for a while, and it will better suit Crunch v2 where there are only containers and they can't request multiple nodes. In the interest of keeping the story narrow, can we just plan on implementing only this strategy? This way we can avoid making larger changes to the wishlist actor, and we don't have to expose configuration knobs to the administrator to let them declare what they're willing to optimize for and how.

Ok, here's how I see it working:

  1. Let's say that job A uses 5 big nodes and job B uses 20 small nodes
  2. Job A completes, there are 5 nodes running
  3. The wishlist returns small node * 20
  4. The 5 big nodes, despite being idle, can't fulfill the wishlist request (not an exact match), so it boots 20 small nodes
  5. Ideally, the 5 big nodes are shut down before 15 small nodes are available (so that the big nodes are not allocated for the next job); if that doesn't happen, then some of the big nodes will get allocated to the job, but there will still be some small nodes booting. In that case we can try to cancel the booting nodes.

#10 Updated by Peter Amstutz almost 4 years ago

New strategy:

  1. Parameterize _nodes_up(), _nodes_missing(), _nodes_busy(), _nodes_wanted(), _nodes_excess(), start_node(), stop_booting_node() on node size.
  2. Modify update_server_wishlist() to iterate over each unique node size in the wishlist.

#11 Updated by Brett Smith almost 4 years ago

That sounds good to me.

#12 Updated by Peter Amstutz almost 4 years ago

  • Status changed from New to In Progress

#13 Updated by Brett Smith almost 4 years ago

  • Target version changed from 2015-11-11 sprint to 2015-12-02 sprint

#14 Updated by Tom Clegg almost 4 years ago

{azure,gce,ec2}.example.cfg should have the comments updated, and perhaps list two sizes each, to make it obvious how to specify them.

I assume the rule is "you'd better specify prices if you list more than one node size" -- that (or whatever the rule really is) should be noted in the comments, and the example(s) should have price(s) listed.

Should we also remove "(N.B.: defining more than one size has not been tested yet)."? Or is it too early...

I wonder if we should even sanity-check the size list, and error out if there are multiple node sizes listed but some have no prices? It seems like that could lead to "seems to be working fine, but then nodemanager restarts with a different random seed and sorts the "None" values differently and now it's spending lots of money"...

"max_nodes" seems to mean "max nodes of each type" now. I'm inclined to say it should mean max nodes total instead. If we want a per-node-type limit it would be better if you could say "max 24 cheap nodes, max 4 expensive nodes", e.g., a separate max_nodes in each "size" section, in addition to the existing overall max_nodes. max_total_price might be a useful config, too. I think any one of these features is fine for a starting point, we don't need all of them.

Whichever way we do it, we should have a test case for max_nodes and multiple node types.

The way "min_nodes" is implemented, it's effectively a per-size minimum that can only be applied to the cheapest size (i.e., if min_nodes==1 and there is 1 node running but it's not the cheapest size, we start a new cheap node). This seems reasonable enough, but it should be documented. Probably defer: As with max_nodes, it might make more sense to move the config knob into the [Size X] section. That way it would be obvious what it does, and the feature would be more flexible.

Is find_size(self, sz) useful? AFAICT it isn't used. (It seems to assure you that the given size object has a recognized id, but I'm not sure where that would be uncertain...)

Seems like ServerCalculatorTestCase should have at least one test that ends up with a heterogeneous wish list, to make sure the job queue gets correctly translated to the cheapest qualifying wish list.

The rest LGTM.

#15 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

{azure,gce,ec2}.example.cfg should have the comments updated, and perhaps list two sizes each, to make it obvious how to specify them.

I assume the rule is "you'd better specify prices if you list more than one node size" -- that (or whatever the rule really is) should be noted in the comments, and the example(s) should have price(s) listed.

Should we also remove "(N.B.: defining more than one size has not been tested yet)."? Or is it too early...

I wonder if we should even sanity-check the size list, and error out if there are multiple node sizes listed but some have no prices? It seems like that could lead to "seems to be working fine, but then nodemanager restarts with a different random seed and sorts the "None" values differently and now it's spending lots of money"...

"max_nodes" seems to mean "max nodes of each type" now. I'm inclined to say it should mean max nodes total instead.

It already means max total nodes: total_up_count = self._nodes_up(None) (where "None" means "all sizes").

If we want a per-node-type limit it would be better if you could say "max 24 cheap nodes, max 4 expensive nodes", e.g., a separate max_nodes in each "size" section, in addition to the existing overall max_nodes. max_total_price might be a useful config, too. I think any one of these features is fine for a starting point, we don't need all of them.

Setting a maximum nodes per size seems less than ideal. For example, because it the wishlist provides exactly one size per job (already discussed above) if there were already 24 nodes booted, and there was a request for 4 more cheap nodes, it would not boot be able to boot 4 more cheap nodes, but it wouldn't boot 4 expensive nodes either, the jobs would just idle in the queue.

However I really like the idea of setting "max_total_price" (spending per hour) since that's what we're actually trying to control by reasoning about node sizes and limits.

Whichever way we do it, we should have a test case for max_nodes and multiple node types.

The way "min_nodes" is implemented, it's effectively a per-size minimum that can only be applied to the cheapest size (i.e., if min_nodes==1 and there is 1 node running but it's not the cheapest size, we start a new cheap node). This seems reasonable enough, but it should be documented. Probably defer: As with max_nodes, it might make more sense to move the config knob into the [Size X] section. That way it would be obvious what it does, and the feature would be more flexible.

I can document it to the comments in the example configurations, not sure where else. We don't have a setup guide for Node manager.

Is find_size(self, sz) useful? AFAICT it isn't used. (It seems to assure you that the given size object has a recognized id, but I'm not sure where that would be uncertain...)

The purpose was to map from a libcloud NodeSize object to our internal CloudSizeWrapper object. However, I decided to tackle the problem differently but forgot to remove it.

Seems like ServerCalculatorTestCase should have at least one test that ends up with a heterogeneous wish list, to make sure the job queue gets correctly translated to the cheapest qualifying wish list.

Good idea.

The rest LGTM.

#16 Updated by Brett Smith almost 4 years ago

Tom Clegg wrote:

I assume the rule is "you'd better specify prices if you list more than one node size" -- that (or whatever the rule really is) should be noted in the comments, and the example(s) should have price(s) listed.

Most cloud drivers provide cost information through the API. The Azure driver currently doesn't support this, but the others do.

I wonder if we should even sanity-check the size list, and error out if there are multiple node sizes listed but some have no prices?

Assuming the code uses the cost information provided by drivers above, I'm inclined to say this is more trouble than it's worth. It would basically be a temporary workaround until the Azure driver learns to query cost information.

"max_nodes" seems to mean "max nodes of each type" now. I'm inclined to say it should mean max nodes total instead.

I think "total maximum" better matches the original intent of the knob, but the opinion I'm most interested in here is ops'.

If we want a per-node-type limit it would be better if you could say "max 24 cheap nodes, max 4 expensive nodes", e.g., a separate max_nodes in each "size" section, in addition to the existing overall max_nodes.

This can be a separate story.

#17 Updated by Peter Amstutz almost 4 years ago

  • Added max_total_price
  • Updated example configs.
  • Max nodes is already maximum total nodes, behavior unchanged
  • Documented behavior of min_nodes (default to smallest) in comment of example configs.
  • Added more tests

Now at f16505d

#18 Updated by Peter Amstutz almost 4 years ago

Bumped to ba9ea75 with fixes from manual integration testing with dummy driver.

#19 Updated by Tom Clegg almost 4 years ago

at ba9ea75

Documented behavior of min_nodes (default to smallest) in comment of example configs

I think the phrase "By default, this will be the cheapest node size" doesn't explain the behavior very well: that sounds to me like "if you don't specify otherwise, your min_nodes will be cheap nodes". I think we should state outright that: if we need to start new idle nodes for the purpose of satisfying min_nodes, we will use the cheapest node type; but, depending on usage patterns, we will also sometimes satisfy min_nodes by keeping alive some more-expensive nodes. (Assuming I'm reading the code correctly this time, of course.)

Rest of the above fixes LGTM, thanks. Some problems introduced with the total price limit, though.

In the following block:
  • should be (total_price + size.price * wanted) ...?
  • should return wanted after the condition (I'm getting lots of test failures, I assume this is related)?
  • should return the maximum number of nodes in 0 <= N <= wanted without exceeding max_price, rather than "all or nothing"? (I'm imagining two 7-node jobs in the queue, we have 6 now, and the max we can afford is 10: it would be better to start 4 new nodes and make progress than to stay stuck at 6 forever.)
        wanted = self._size_wishlist(size) - up_count
        if wanted > 0 and self.max_total_price and ((total_price + size.price) > self.max_total_price):
                self._logger.info("Not booting %s (price %s) because with it would exceed max_total_price of %s (current total_price is %s)",
                                  size.name, size.price, self.max_total_price, total_price)
                return 0

        return

ServerCalculator.servers_for_queue checks want_count <= self.max_nodes, and I'm assuming this is how we avoid the situation where a job wants 12, max_nodes is 10, and we leave 10 running and spend forever wishing for 2 more. How can we avoid the analogous "expensive deadlock" situation with max_price?

Let's add a test case along the lines of ServerCalculatorTestCase.test_ignore_unsatisfiable_jobs to ensure we don't do anything horrible when a job costs more than max_total_price.

Given the busywait, is the assertEqual here superfluous, or is there something more subtle happening? I suppose stop_proxy here is where we would say "wait for all messages to be processed" if we had such a thing. Does this code really assure us node_setup.start won't get called a fourth time, or would a "max_price ignored" bug make this test outcome start depending on a race? (If so, it's worth a comment to help future debugging/understanding.)

        self.busywait(lambda: self.node_setup.start.call_count == 3)
        booting = self.daemon.booting.get()
        self.stop_proxy(self.daemon)
        self.assertEqual(3, self.node_setup.start.call_count)

This looks like a stray debug printf in test_node_max_price:

logging.info(sizecounts)

This comment means "there are multiple correct answers, but this is the one the current code gives us", right?

        # The way the update_server_wishlist() works effectively results in a
        # round-robin creation of one node of each size in the wishlist
        self.assertEqual(2, sizecounts[small.id])
        self.assertEqual(1, sizecounts[big.id])

#20 Updated by Peter Amstutz almost 4 years ago

Tom Clegg wrote:

at ba9ea75

Documented behavior of min_nodes (default to smallest) in comment of example configs

I think the phrase "By default, this will be the cheapest node size" doesn't explain the behavior very well: that sounds to me like "if you don't specify otherwise, your min_nodes will be cheap nodes". I think we should state outright that: if we need to start new idle nodes for the purpose of satisfying min_nodes, we will use the cheapest node type; but, depending on usage patterns, we will also sometimes satisfy min_nodes by keeping alive some more-expensive nodes. (Assuming I'm reading the code correctly this time, of course.)

Updated example configs with a more explicit comment.

Rest of the above fixes LGTM, thanks. Some problems introduced with the total price limit, though.

In the following block:
  • should be (total_price + size.price * wanted) ...?

Because start_node() only boots one node at a time and then rechecks _nodes_wanted(), checking the price of adding one additional node at a time achieves the desired behavior. But I agree that it is a little confusing (fixed as noted below).

  • should return wanted after the condition (I'm getting lots of test failures, I assume this is related)?

You're right, that's a typo. Shame on me for not running the tests again.

  • should return the maximum number of nodes in 0 <= N <= wanted without exceeding max_price, rather than "all or nothing"? (I'm imagining two 7-node jobs in the queue, we have 6 now, and the max we can afford is 10: it would be better to start 4 new nodes and make progress than to stay stuck at 6 forever.)

Now it adjusts the nodes wanted based on the price cap (so if it wants up to 6 but can only afford 4, it will return 4).

ServerCalculator.servers_for_queue checks want_count <= self.max_nodes, and I'm assuming this is how we avoid the situation where a job wants 12, max_nodes is 10, and we leave 10 running and spend forever wishing for 2 more. How can we avoid the analogous "expensive deadlock" situation with max_price?

Addressed by the taking the same approach as max_nodes, checking (want_count * price <= max_price).

Let's add a test case along the lines of ServerCalculatorTestCase.test_ignore_unsatisfiable_jobs to ensure we don't do anything horrible when a job costs more than max_total_price.

Done.

Given the busywait, is the assertEqual here superfluous, or is there something more subtle happening? I suppose stop_proxy here is where we would say "wait for all messages to be processed" if we had such a thing. Does this code really assure us node_setup.start won't get called a fourth time, or would a "max_price ignored" bug make this test outcome start depending on a race? (If so, it's worth a comment to help future debugging/understanding.)

So the busywait already asserts the expected condition when it comes out of the loop, so the additional assertion is unnecessary. A "max_price ignored" bug resulting in an addition node created will result in test failure (just tried it). It's potentially still a little racy but I think it's the best we can do without more help from Pykka (Brett and I discussed this, unfortunately there's no "wait until actor is quiescent and then stop" method).

This looks like a stray debug printf in test_node_max_price:

logging.info(sizecounts)

Fixed.

This comment means "there are multiple correct answers, but this is the one the current code gives us", right?

Yes, I expanded the comment to explain that better.

Now at b2ca3a0

#21 Updated by Tom Clegg almost 4 years ago

LGTM, thanks.

#22 Updated by Peter Amstutz almost 4 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados|commit:58692c916bb6dfe2997838ca4147109d9410c86a.

#23 Updated by Peter Amstutz almost 4 years ago

  • Status changed from Resolved to In Progress

Reopening because it turns out that ec2 and azure don't populate the libcloud Node.size field, but the code was written on the assumption that it would be there. It turns out the information is available in "extra", we just need to get it out manually.

#24 Updated by Brett Smith over 3 years ago

  • Status changed from In Progress to Resolved

The bug noted in Peter's last comment has since been fixed.

The GCE driver code has not been tested but we'll handle that separately. Right now we don't even have a place to test it.

Also available in: Atom PDF