Bug #13804

Updated by Peter Amstutz about 2 years ago

There seems to be unnecessary churn in Node Manager, with nodes being shut down and new nodes booted to replace them even when there is a wishlist for a particular node size.

Nico is reporting that there are nodes in draining state which are in the middle of running their jobs, presumably due to lag between the node being reported as "idle", a job being scheduled on the node, and node manager initiating shutdown (which first puts the node into "draining").

Two ideas to reduce churn:

1) Add an "idle grace period" so that nodes have to be idle for a few minutes before they enter shutdown state.

2) After a node enters the "drain" state (meaning it isn't running anything, but won't take any new work), check back with the Daemon actor to see if nodes are wanted for that node size, and if so, cancel shutdown.