Bug #13804

Updated by Peter Amstutz over 1 year ago

There seems to be unnecessary churn in Node Manager, with nodes being shut down and new nodes booted to replace them even when there is a wishlist for a particular node size.

Nico is reporting that there are nodes in draining state which are in the middle of running their jobs, presumably due to lag between the node being reported as "idle", a job being scheduled on the node, and node manager initiating shutdown (which first puts the node into "draining").

Two ideas to reduce churn:

1) Add an "idle grace period" so that nodes have to be idle for a few minutes before they enter shutdown state. A variant of this would be a rule like "node must be reported as idle 2-3 reports in a row" before being eligible for shutdown.

2) After a node enters the "drain" state (meaning it isn't running anything, but won't take any new work), check back with the Daemon actor to see if nodes are wanted for that node size, and if so, cancel shutdown.

Back