https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422015-04-03T21:35:27ZArvadosArvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=232032015-04-03T21:35:27ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Subject</strong> changed from <i>Allocating the correct amount of resources for each job in a pipeline</i> to <i>[Node Manager] Support multiple node sizes and boot new nodes correctly from them</i></li><li><strong>Category</strong> set to <i>Node Manager</i></li><li><strong>Target version</strong> set to <i>Arvados Future Sprints</i></li></ul> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=317982015-10-27T18:25:02ZBrett Smithbrett.smith@curii.com
<ul></ul><p>The only thing we <strong>know</strong> we need to do here is that the daemon needs to count nodes by size, and make boot/shutdown decisions accordingly. Right now it just does a naive count of all nodes, and that won't be sufficient with multiple sizes.</p>
<p>The Azure ARM driver doesn't include cost information for node sizes, but that's not a blocker because we can fill in that information in the Node Manager configuration under each size.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=317992015-10-27T18:27:46ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Azure ARM billing API provides price information:</p>
<p><a class="external" href="https://msdn.microsoft.com/en-us/library/azure/mt219004.aspx">https://msdn.microsoft.com/en-us/library/azure/mt219004.aspx</a></p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=318482015-10-28T14:50:37ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/31848/diff?detail_id=31263">diff</a>)</li><li><strong>Target version</strong> changed from <i>Arvados Future Sprints</i> to <i>2015-11-11 sprint</i></li></ul> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=319212015-10-28T19:27:04ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Story points</strong> set to <i>3.0</i></li></ul> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=319252015-10-28T19:38:13ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Peter Amstutz</i></li></ul> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=320472015-11-03T01:22:47ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>- idea : wishlist entry can be list of several acceptable sizes instead of exactly one size</p>
<p>Algorithm:</p>
<p>inputs: wishlist, idle node list (including booting/booted) sorted by price<br />lowest-highest</p>
<p>While both lists still have items:<br /> - Get front of wishlist<br /> - Get 1st idle node that fit wishlist item<br /> - Pop wishlist node & idle node</p>
<p>If idle node list is empty & wishlist isn't:<br />- Take 1st item in wishlist & boot cheapest node<br />- Schedule new run of algorithm</p>
<p>If wishlist is empty & idle node list isn't:<br />- Shut down most expensive idle node<br />- Schedule new run of algorithm</p>
<p>Notes:</p>
<p>- To schedule a smaller job on a bigger node immediately (instead of waiting for<br /> smaller node to boot up) return multiple acceptable sizes in wishlist.</p>
<p>- If we want to run jobs on exactly the cheapest node size, then return exactly<br /> one node size for each item in the wishlist; this won't assign wishlist items<br /> to bigger nodes and instead will boot more nodes of the desired size. Once<br /> idle list fufills wishlist, will start shutting down larger nodes that are<br /> unused.</p>
<p>Brett, does this make sense?</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=320582015-11-03T16:06:25ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>While both lists still have items:<br />- Get front of wishlist<br />- Get 1st idle node that fit wishlist item<br />- Pop wishlist node & idle node</p>
</blockquote>
<p>Just a heads up: list.pop(0) is shockingly expensive in Python. You'll either want to sort most-to-least expensive so you can just do list.pop() (cheap), or use collections.deque.</p>
<blockquote>
<p>If idle node list is empty & wishlist isn't:<br />- Take 1st item in wishlist & boot cheapest node<br />- Schedule new run of algorithm</p>
</blockquote>
<p>For what it's worth, the current code does its "compare the wishlist vs. current nodes state, boot node if needed" logic each time it gets an updated wishlist. Since that happens regularly, it converges on the desired state pretty quickly, without acting overly fast. I believe you could follow that same basic strategy, rather than having separate scheduling.</p>
<blockquote>
<p>If wishlist is empty & idle node list isn't:<br />- Shut down most expensive idle node<br />- Schedule new run of algorithm</p>
</blockquote>
<p>This doesn't take cloud node billing cycles into account. We only make shutdown decisions when the node reports that it's eligible for shutdown (which considers both idleness and the billing cycle), and I think we should continue to follow that. You may need to store the manipulated wishlist+node list, and then use that information to decide whether or not to go ahead with a shutdown when a ComputeNodeMonitorActor sends the daemon node_can_shutdown.</p>
<blockquote>
<p>Notes:</p>
<p>- To schedule a smaller job on a bigger node immediately (instead of waiting for<br />smaller node to boot up) return multiple acceptable sizes in wishlist.</p>
</blockquote>
<p>You know that, fundamentally, the dispatch decision is not Node Manager's, but crunch-dispatch's. If idle nodes are available that satisfy the job, crunch-dispatch will use them, no matter what. Node Manager only has this level of control when there are N large nodes up, and a job needs M > N smaller nodes to run. Node Manager then has to decide whether to shut down the larger nodes (assuming that makes sense in the billing cycle), and boot M small nodes, or just leave the large nodes up and boot M - N small nodes.</p>
<p>That decision is fundamentally one of cost optimization that different administrators will decide differently. It's also one where the outcomes change as other components (crunch-dispatch, Crunch v2) change.</p>
<blockquote>
<p>- If we want to run jobs on exactly the cheapest node size, then return exactly<br />one node size for each item in the wishlist; this won't assign wishlist items<br />to bigger nodes and instead will boot more nodes of the desired size. Once<br />idle list fufills wishlist, will start shutting down larger nodes that are<br />unused.</p>
</blockquote>
<p>I think this will meet most of our users' needs for a while, and it will better suit Crunch v2 where there are only containers and they can't request multiple nodes. In the interest of keeping the story narrow, can we just plan on implementing only this strategy? This way we can avoid making larger changes to the wishlist actor, and we don't have to expose configuration knobs to the administrator to let them declare what they're willing to optimize for and how.</p>
<blockquote>
<p>Brett, does this make sense?</p>
</blockquote>
<p>Yes.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=321282015-11-05T18:12:51ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Brett Smith wrote:</p>
<blockquote><blockquote>
<p>If idle node list is empty & wishlist isn't:<br />- Take 1st item in wishlist & boot cheapest node<br />- Schedule new run of algorithm</p>
</blockquote>
<p>For what it's worth, the current code does its "compare the wishlist vs. current nodes state, boot node if needed" logic each time it gets an updated wishlist. Since that happens regularly, it converges on the desired state pretty quickly, without acting overly fast. I believe you could follow that same basic strategy, rather than having separate scheduling.</p>
</blockquote>
<p>I think you're misremembering. In the existing code, <code>start_node()</code> computes <code>nodes_wanted()</code>, boots one node if <code>wanted > 0</code>, and then reschedules <code>start_node()</code> if <code>wanted > 1</code>. I intend to preserve that behavior, except that <code>nodes_wanted()</code> will be returning a list of sizes rather than an integer.</p>
<blockquote><blockquote>
<p>If wishlist is empty & idle node list isn't:<br />- Shut down most expensive idle node<br />- Schedule new run of algorithm</p>
</blockquote>
<p>This doesn't take cloud node billing cycles into account. We only make shutdown decisions when the node reports that it's eligible for shutdown (which considers both idleness and the billing cycle), and I think we should continue to follow that. You may need to store the manipulated wishlist+node list, and then use that information to decide whether or not to go ahead with a shutdown when a ComputeNodeMonitorActor sends the daemon node_can_shutdown.</p>
</blockquote>
<p>To clarify, I meant "shut down the most expensive node that is eligible for shutdown" by first checking <code>shutdown_eligible()</code> (which takes billing cycles into account). However I see now that the decision is made on a ping from the monitor actor in <code>node_can_shutdown()</code>.</p>
<blockquote><blockquote>
<p>Notes:</p>
<p>- To schedule a smaller job on a bigger node immediately (instead of waiting for<br />smaller node to boot up) return multiple acceptable sizes in wishlist.</p>
</blockquote>
<p>You know that, fundamentally, the dispatch decision is not Node Manager's, but crunch-dispatch's. If idle nodes are available that satisfy the job, crunch-dispatch will use them, no matter what. Node Manager only has this level of control when there are N large nodes up, and a job needs M > N smaller nodes to run. Node Manager then has to decide whether to shut down the larger nodes (assuming that makes sense in the billing cycle), and boot M small nodes, or just leave the large nodes up and boot M - N small nodes.</p>
<p>That decision is fundamentally one of cost optimization that different administrators will decide differently. It's also one where the outcomes change as other components (crunch-dispatch, Crunch v2) change.</p>
<blockquote>
<p>- If we want to run jobs on exactly the cheapest node size, then return exactly<br />one node size for each item in the wishlist; this won't assign wishlist items<br />to bigger nodes and instead will boot more nodes of the desired size. Once<br />idle list fufills wishlist, will start shutting down larger nodes that are<br />unused.</p>
</blockquote>
<p>I think this will meet most of our users' needs for a while, and it will better suit Crunch v2 where there are only containers and they can't request multiple nodes. In the interest of keeping the story narrow, can we just plan on implementing only this strategy? This way we can avoid making larger changes to the wishlist actor, and we don't have to expose configuration knobs to the administrator to let them declare what they're willing to optimize for and how.</p>
</blockquote>
<p>Ok, here's how I see it working:</p>
<ol>
<li>Let's say that job A uses 5 big nodes and job B uses 20 small nodes </li>
<li>Job A completes, there are 5 nodes running</li>
<li>The wishlist returns small node * 20</li>
<li>The 5 big nodes, despite being idle, can't fulfill the wishlist request (not an exact match), so it boots 20 small nodes</li>
<li>Ideally, the 5 big nodes are shut down before 15 small nodes are available (so that the big nodes are not allocated for the next job); if that doesn't happen, then some of the big nodes will get allocated to the job, but there will still be some small nodes booting. In that case we can try to cancel the booting nodes.</li>
</ol> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=321302015-11-05T18:31:56ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>New strategy:</p>
<ol>
<li>Parameterize <code>_nodes_up()</code>, <code>_nodes_missing()</code>, <code>_nodes_busy()</code>, <code>_nodes_wanted()</code>, <code>_nodes_excess()</code>, <code>start_node()</code>, <code>stop_booting_node()</code> on node size.</li>
<li>Modify <code>update_server_wishlist()</code> to iterate over each unique node size in the wishlist.</li>
</ol> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=321422015-11-06T14:06:31ZBrett Smithbrett.smith@curii.com
<ul></ul><p>That sounds good to me.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=321762015-11-09T15:20:03ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=322922015-11-12T13:51:37ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Target version</strong> changed from <i>2015-11-11 sprint</i> to <i>2015-12-02 sprint</i></li></ul> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=323982015-11-12T21:39:31ZTom Cleggtom@curii.com
<ul></ul><p>{azure,gce,ec2}.example.cfg should have the comments updated, and perhaps list two sizes each, to make it obvious how to specify them.</p>
<p>I assume the rule is "you'd better specify prices if you list more than one node size" -- that (or whatever the rule really is) should be noted in the comments, and the example(s) should have price(s) listed.</p>
<p>Should we also remove "(N.B.: defining more than one size has not been tested yet)."? Or is it too early...</p>
<p>I wonder if we should even sanity-check the size list, and error out if there are multiple node sizes listed but some have no prices? It seems like that could lead to "seems to be working fine, but then nodemanager restarts with a different random seed and sorts the "None" values differently and now it's spending lots of money"...</p>
<p>"max_nodes" seems to mean "max nodes of each type" now. I'm inclined to say it should mean max nodes total instead. If we want a per-node-type limit it would be better if you could say "max 24 cheap nodes, max 4 expensive nodes", e.g., a separate max_nodes in each "size" section, in addition to the existing overall max_nodes. max_total_price might be a useful config, too. I think any one of these features is fine for a starting point, we don't need all of them.</p>
<p>Whichever way we do it, we should have a test case for max_nodes and multiple node types.</p>
<p>The way "min_nodes" is implemented, it's effectively a per-size minimum that can only be applied to the cheapest size (i.e., if min_nodes==1 and there is 1 node running but it's not the cheapest size, we start a new cheap node). This seems reasonable enough, but it should be documented. <em>Probably defer: As with max_nodes, it might make more sense to move the config knob into the <code>[Size X]</code> section. That way it would be obvious what it does, and the feature would be more flexible.</em></p>
<p>Is <code>find_size(self, sz)</code> useful? AFAICT it isn't used. (It seems to assure you that the given size object has a recognized id, but I'm not sure where that would be uncertain...)</p>
<p>Seems like ServerCalculatorTestCase should have at least one test that ends up with a heterogeneous wish list, to make sure the job queue gets correctly translated to the cheapest qualifying wish list.</p>
<p>The rest LGTM.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=324402015-11-16T15:14:11ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>{azure,gce,ec2}.example.cfg should have the comments updated, and perhaps list two sizes each, to make it obvious how to specify them.</p>
<p>I assume the rule is "you'd better specify prices if you list more than one node size" -- that (or whatever the rule really is) should be noted in the comments, and the example(s) should have price(s) listed.</p>
<p>Should we also remove "(N.B.: defining more than one size has not been tested yet)."? Or is it too early...</p>
<p>I wonder if we should even sanity-check the size list, and error out if there are multiple node sizes listed but some have no prices? It seems like that could lead to "seems to be working fine, but then nodemanager restarts with a different random seed and sorts the "None" values differently and now it's spending lots of money"...</p>
<p>"max_nodes" seems to mean "max nodes of each type" now. I'm inclined to say it should mean max nodes total instead.</p>
</blockquote>
<p>It already means max total nodes: <code>total_up_count = self._nodes_up(None)</code> (where "None" means "all sizes").</p>
<blockquote>
<p>If we want a per-node-type limit it would be better if you could say "max 24 cheap nodes, max 4 expensive nodes", e.g., a separate max_nodes in each "size" section, in addition to the existing overall max_nodes. max_total_price might be a useful config, too. I think any one of these features is fine for a starting point, we don't need all of them.</p>
</blockquote>
<p>Setting a maximum nodes per size seems less than ideal. For example, because it the wishlist provides exactly one size per job (already discussed above) if there were already 24 nodes booted, and there was a request for 4 more cheap nodes, it would not boot be able to boot 4 more cheap nodes, but it wouldn't boot 4 expensive nodes either, the jobs would just idle in the queue.</p>
<p>However I really like the idea of setting "max_total_price" (spending per hour) since that's what we're actually trying to control by reasoning about node sizes and limits.</p>
<blockquote>
<p>Whichever way we do it, we should have a test case for max_nodes and multiple node types.</p>
<p>The way "min_nodes" is implemented, it's effectively a per-size minimum that can only be applied to the cheapest size (i.e., if min_nodes==1 and there is 1 node running but it's not the cheapest size, we start a new cheap node). This seems reasonable enough, but it should be documented. <em>Probably defer: As with max_nodes, it might make more sense to move the config knob into the <code>[Size X]</code> section. That way it would be obvious what it does, and the feature would be more flexible.</em></p>
</blockquote>
<p>I can document it to the comments in the example configurations, not sure where else. We don't have a setup guide for Node manager.</p>
<blockquote>
<p>Is <code>find_size(self, sz)</code> useful? AFAICT it isn't used. (It seems to assure you that the given size object has a recognized id, but I'm not sure where that would be uncertain...)</p>
</blockquote>
<p>The purpose was to map from a libcloud NodeSize object to our internal CloudSizeWrapper object. However, I decided to tackle the problem differently but forgot to remove it.</p>
<blockquote>
<p>Seems like ServerCalculatorTestCase should have at least one test that ends up with a heterogeneous wish list, to make sure the job queue gets correctly translated to the cheapest qualifying wish list.</p>
</blockquote>
<p>Good idea.</p>
<blockquote>
<p>The rest LGTM.</p>
</blockquote> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=324412015-11-16T15:14:22ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>I assume the rule is "you'd better specify prices if you list more than one node size" -- that (or whatever the rule really is) should be noted in the comments, and the example(s) should have price(s) listed.</p>
</blockquote>
<p>Most cloud drivers provide cost information through the API. The Azure driver currently doesn't support this, but the others do.</p>
<blockquote>
<p>I wonder if we should even sanity-check the size list, and error out if there are multiple node sizes listed but some have no prices?</p>
</blockquote>
<p>Assuming the code uses the cost information provided by drivers above, I'm inclined to say this is more trouble than it's worth. It would basically be a temporary workaround until the Azure driver learns to query cost information.</p>
<blockquote>
<p>"max_nodes" seems to mean "max nodes of each type" now. I'm inclined to say it should mean max nodes total instead.</p>
</blockquote>
<p>I think "total maximum" better matches the original intent of the knob, but the opinion I'm most interested in here is ops'.</p>
<blockquote>
<p>If we want a per-node-type limit it would be better if you could say "max 24 cheap nodes, max 4 expensive nodes", e.g., a separate max_nodes in each "size" section, in addition to the existing overall max_nodes.</p>
</blockquote>
<p>This can be a separate story.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=324592015-11-16T21:28:33ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><ul>
<li>Added max_total_price</li>
<li>Updated example configs.</li>
<li>Max nodes is already maximum total nodes, behavior unchanged</li>
<li>Documented behavior of min_nodes (default to smallest) in comment of example configs.</li>
<li>Added more tests</li>
</ul>
<p>Now at <a class="changeset" title="5353: Added max_total_price. Added more tests for multiple node sizes. Updated config file examp..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f16505d89d60d3bd5abc04762712f4ddf78e39fe">f16505d</a></p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=324662015-11-16T22:01:51ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Bumped to <a class="changeset" title="5353: Fixes from testing with Dummy driver." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/ba9ea75ee6d6322f5d2a9da0b1c01c2738c0927d">ba9ea75</a> with fixes from manual integration testing with dummy driver.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=325042015-11-17T17:16:50ZTom Cleggtom@curii.com
<ul></ul><p>at <a class="changeset" title="5353: Fixes from testing with Dummy driver." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/ba9ea75ee6d6322f5d2a9da0b1c01c2738c0927d">ba9ea75</a></p>
<blockquote>
<p>Documented behavior of min_nodes (default to smallest) in comment of example configs</p>
</blockquote>
<p>I think the phrase "By default, this will be the cheapest node size" doesn't explain the behavior very well: that sounds to me like "if you don't specify otherwise, your min_nodes will be cheap nodes". I think we should state outright that: if we need to start new idle nodes for the purpose of satisfying min_nodes, we will use the cheapest node type; but, depending on usage patterns, we will also sometimes satisfy min_nodes by keeping alive some more-expensive nodes. (Assuming I'm reading the code correctly this time, of course.)</p>
<p>Rest of the above fixes LGTM, thanks. Some problems introduced with the total price limit, though.</p>
In the following block:
<ul>
<li>should be <code>(total_price + size.price * wanted)</code> ...?</li>
<li>should <code>return wanted</code> after the condition (I'm getting lots of test failures, I assume this is related)?</li>
<li>should return the maximum number of nodes in <code class="python syntaxhl"><span class="mi">0</span> <span class="o"><=</span> <span class="n">N</span> <span class="o"><=</span> <span class="n">wanted</span></code> without exceeding max_price, rather than "all or nothing"? (I'm imagining two 7-node jobs in the queue, we have 6 now, and the max we can afford is 10: it would be better to start 4 new nodes and make progress than to stay stuck at 6 forever.)</li>
</ul>
<pre><code class="python syntaxhl"> <span class="n">wanted</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_size_wishlist</span><span class="p">(</span><span class="n">size</span><span class="p">)</span> <span class="o">-</span> <span class="n">up_count</span>
<span class="k">if</span> <span class="n">wanted</span> <span class="o">></span> <span class="mi">0</span> <span class="ow">and</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_total_price</span> <span class="ow">and</span> <span class="p">((</span><span class="n">total_price</span> <span class="o">+</span> <span class="n">size</span><span class="p">.</span><span class="n">price</span><span class="p">)</span> <span class="o">></span> <span class="bp">self</span><span class="p">.</span><span class="n">max_total_price</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">_logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Not booting %s (price %s) because with it would exceed max_total_price of %s (current total_price is %s)"</span><span class="p">,</span>
<span class="n">size</span><span class="p">.</span><span class="n">name</span><span class="p">,</span> <span class="n">size</span><span class="p">.</span><span class="n">price</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_total_price</span><span class="p">,</span> <span class="n">total_price</span><span class="p">)</span>
<span class="k">return</span> <span class="mi">0</span>
<span class="k">return</span>
</code></pre>
<p>ServerCalculator.servers_for_queue checks <code class="python syntaxhl"><span class="n">want_count</span> <span class="o"><=</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_nodes</span></code>, and I'm assuming this is how we avoid the situation where a job wants 12, max_nodes is 10, and we leave 10 running and spend forever wishing for 2 more. How can we avoid the analogous "expensive deadlock" situation with max_price?</p>
<p>Let's add a test case along the lines of ServerCalculatorTestCase.test_ignore_unsatisfiable_jobs to ensure we don't do anything horrible when a job costs more than max_total_price.</p>
<p>Given the busywait, is the assertEqual here superfluous, or is there something more subtle happening? I suppose stop_proxy here is where we would say "wait for all messages to be processed" if we had such a thing. Does this code really assure us node_setup.start won't get called a fourth time, or would a "max_price ignored" bug make this test outcome start depending on a race? (If so, it's worth a comment to help future debugging/understanding.)</p>
<pre><code class="python syntaxhl"> <span class="bp">self</span><span class="p">.</span><span class="n">busywait</span><span class="p">(</span><span class="k">lambda</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">node_setup</span><span class="p">.</span><span class="n">start</span><span class="p">.</span><span class="n">call_count</span> <span class="o">==</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">booting</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">daemon</span><span class="p">.</span><span class="n">booting</span><span class="p">.</span><span class="n">get</span><span class="p">()</span>
<span class="bp">self</span><span class="p">.</span><span class="n">stop_proxy</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">daemon</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">node_setup</span><span class="p">.</span><span class="n">start</span><span class="p">.</span><span class="n">call_count</span><span class="p">)</span>
</code></pre>
<p>This looks like a stray debug printf in test_node_max_price:</p>
<pre><code>logging.info(sizecounts)</code></pre>
<p>This comment means "there are multiple correct answers, but this is the one the current code gives us", right?</p>
<pre><code class="python syntaxhl"> <span class="c1"># The way the update_server_wishlist() works effectively results in a
</span> <span class="c1"># round-robin creation of one node of each size in the wishlist
</span> <span class="bp">self</span><span class="p">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">sizecounts</span><span class="p">[</span><span class="n">small</span><span class="p">.</span><span class="nb">id</span><span class="p">])</span>
<span class="bp">self</span><span class="p">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">sizecounts</span><span class="p">[</span><span class="n">big</span><span class="p">.</span><span class="nb">id</span><span class="p">])</span>
</code></pre> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=325722015-11-18T15:23:46ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>at <a class="changeset" title="5353: Fixes from testing with Dummy driver." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/ba9ea75ee6d6322f5d2a9da0b1c01c2738c0927d">ba9ea75</a></p>
<blockquote>
<p>Documented behavior of min_nodes (default to smallest) in comment of example configs</p>
</blockquote>
<p>I think the phrase "By default, this will be the cheapest node size" doesn't explain the behavior very well: that sounds to me like "if you don't specify otherwise, your min_nodes will be cheap nodes". I think we should state outright that: if we need to start new idle nodes for the purpose of satisfying min_nodes, we will use the cheapest node type; but, depending on usage patterns, we will also sometimes satisfy min_nodes by keeping alive some more-expensive nodes. (Assuming I'm reading the code correctly this time, of course.)</p>
</blockquote>
<p>Updated example configs with a more explicit comment.</p>
<blockquote>
<p>Rest of the above fixes LGTM, thanks. Some problems introduced with the total price limit, though.</p>
In the following block:
<ul>
<li>should be <code>(total_price + size.price * wanted)</code> ...?</li>
</ul>
</blockquote>
<p>Because <code>start_node()</code> only boots one node at a time and then rechecks <code>_nodes_wanted()</code>, checking the price of adding one additional node at a time achieves the desired behavior. But I agree that it is a little confusing (fixed as noted below).</p>
<blockquote>
<ul>
<li>should <code>return wanted</code> after the condition (I'm getting lots of test failures, I assume this is related)?</li>
</ul>
</blockquote>
<p>You're right, that's a typo. Shame on me for not running the tests again.</p>
<blockquote>
<ul>
<li>should return the maximum number of nodes in <code class="python syntaxhl"><span class="mi">0</span> <span class="o"><=</span> <span class="n">N</span> <span class="o"><=</span> <span class="n">wanted</span></code> without exceeding max_price, rather than "all or nothing"? (I'm imagining two 7-node jobs in the queue, we have 6 now, and the max we can afford is 10: it would be better to start 4 new nodes and make progress than to stay stuck at 6 forever.)</li>
</ul>
</blockquote>
<p>Now it adjusts the nodes wanted based on the price cap (so if it wants up to 6 but can only afford 4, it will return 4).</p>
<blockquote>
<p>ServerCalculator.servers_for_queue checks <code class="python syntaxhl"><span class="n">want_count</span> <span class="o"><=</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_nodes</span></code>, and I'm assuming this is how we avoid the situation where a job wants 12, max_nodes is 10, and we leave 10 running and spend forever wishing for 2 more. How can we avoid the analogous "expensive deadlock" situation with max_price?</p>
</blockquote>
<p>Addressed by the taking the same approach as max_nodes, checking <code>(want_count * price <= max_price)</code>.</p>
<blockquote>
<p>Let's add a test case along the lines of ServerCalculatorTestCase.test_ignore_unsatisfiable_jobs to ensure we don't do anything horrible when a job costs more than max_total_price.</p>
</blockquote>
<p>Done.</p>
<blockquote>
<p>Given the busywait, is the assertEqual here superfluous, or is there something more subtle happening? I suppose stop_proxy here is where we would say "wait for all messages to be processed" if we had such a thing. Does this code really assure us node_setup.start won't get called a fourth time, or would a "max_price ignored" bug make this test outcome start depending on a race? (If so, it's worth a comment to help future debugging/understanding.)</p>
</blockquote>
<p>So the busywait already asserts the expected condition when it comes out of the loop, so the additional assertion is unnecessary. A "max_price ignored" bug resulting in an addition node created will result in test failure (just tried it). It's potentially still a little racy but I think it's the best we can do without more help from Pykka (Brett and I discussed this, unfortunately there's no "wait until actor is quiescent and then stop" method).</p>
<blockquote>
<p>This looks like a stray debug printf in test_node_max_price:</p>
<p>logging.info(sizecounts)</p>
</blockquote>
<p>Fixed.</p>
<blockquote>
<p>This comment means "there are multiple correct answers, but this is the one the current code gives us", right?</p>
</blockquote>
<p>Yes, I expanded the comment to explain that better.</p>
<p>Now at <a class="changeset" title="5353: Remove extra assertion because busywait does it for us." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/b2ca3a093f36370719d4a6e89ba46a45111f4f63">b2ca3a0</a></p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=325772015-11-18T17:07:09ZTom Cleggtom@curii.com
<ul></ul><p>LGTM, thanks.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=325802015-11-18T17:15:09ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>Applied in changeset arvados|commit:58692c916bb6dfe2997838ca4147109d9410c86a.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=326242015-11-19T17:22:14ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>In Progress</i></li></ul><p>Reopening because it turns out that ec2 and azure don't populate the libcloud Node.size field, but the code was written on the assumption that it would be there. It turns out the information is available in "extra", we just need to get it out manually.</p> Arvados - Idea #5353: [Node Manager] Support multiple node sizes and boot new nodes correctly from themhttps://dev.arvados.org/issues/5353?journal_id=330472015-12-02T19:21:24ZBrett Smithbrett.smith@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>The bug noted in Peter's last comment has since been fixed.</p>
<p>The GCE driver code has not been tested but we'll handle that separately. Right now we don't even have a place to test it.</p>