Story #5353

Updated by Brett Smith over 4 years ago

Correct me if I'm wrong but if we're in the cloud, we're able to pick out the specs that we want on each node, in order to save compute costs. Because I'm betting that more RAM costs more money. I doubt this could be dynamically allocated, but with trial and error, a bioinformatician should know how much they need to allocate.

for example:
assume job 1 requires 1 node with 50GB of ram, 2 cores, 100GB local space.
assume job 2 requires 2 nodes with 10GB of ram, 5 cores, 500GB local space.

h2. Implementation

The Node Manager daemon currently treats the node size wishlist as homogeneous. For this change, it effectively needs to consider each size to be a separate wishlist, and make boot/shutdown decisions accordingly.

For each size S:

* If there are more S nodes in the wishlist than S idle nodes running in the cloud, make sure a new S is booting.
* If an S node is eligible for shutdown, and there are more S idle nodes running in the cloud than there are in the wishlist, start shutting down the node.
* I'm not sure how often this will come up, but if it ever makes sense: it would generally be better to act on requests for smaller sizes before larger ones. This will help ensure that jobs that can fit in smaller nodes are dispatched to them, helping keep larger nodes available for jobs that actually require them. We understand that, due to limitations in Crunch, we won't always get the most cost-effective match, and that's fine. This change to Node Manager will make it easier for us to improve Crunch later.

Whenever the daemon currently accounts for booting or shutting down nodes in its math, you're going to have to do the same, but filtering the results out by size. This might be a reasonable time to refactor the daemon's internal data structures to make this easier.