Project

General

Profile

Actions

Idea #5353

closed

[Node Manager] Support multiple node sizes and boot new nodes correctly from them

Added by Bryan Cosca almost 10 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Node Manager
Target version:
Start date:
03/02/2015
Due date:
Story points:
3.0

Description

Correct me if I'm wrong but if we're in the cloud, we're able to pick out the specs that we want on each node, in order to save compute costs. Because I'm betting that more RAM costs more money. I doubt this could be dynamically allocated, but with trial and error, a bioinformatician should know how much they need to allocate.

for example:
assume job 1 requires 1 node with 50GB of ram, 2 cores, 100GB local space.
assume job 2 requires 2 nodes with 10GB of ram, 5 cores, 500GB local space.

Implementation

The Node Manager daemon currently treats the node size wishlist as homogeneous. For this change, it effectively needs to consider each size to be a separate wishlist, and make boot/shutdown decisions accordingly.

For each size S:

  • If there are more S nodes in the wishlist than S idle nodes running in the cloud, make sure a new S is booting.
  • If an S node is eligible for shutdown, and there are more S idle nodes running in the cloud than there are in the wishlist, start shutting down the node.
  • I'm not sure how often this will come up, but if it ever makes sense: it would generally be better to act on requests for smaller sizes before larger ones. This will help ensure that jobs that can fit in smaller nodes are dispatched to them, helping keep larger nodes available for jobs that actually require them. We understand that, due to limitations in Crunch, we won't always get the most cost-effective match, and that's fine. This change to Node Manager will make it easier for us to improve Crunch later.

Whenever the daemon currently accounts for booting or shutting down nodes in its math, you're going to have to do the same, but filtering the results out by size. This might be a reasonable time to refactor the daemon's internal data structures to make this easier.


Subtasks 5 (0 open5 closed)

Task #7678: Write testsResolvedPeter Amstutz03/02/2015Actions
Task #7681: Update documentationResolvedPeter Amstutz03/02/2015Actions
Task #7841: Test pipeline with multiple node sizesResolvedPeter Amstutz03/02/2015Actions
Task #7679: Review 5353-node-sizesResolvedPeter Amstutz03/02/2015Actions
Task #7677: Refactor to support multiple node typesResolvedPeter Amstutz03/02/2015Actions
Actions

Also available in: Atom PDF