[Nodemanager] Explicit node record states
|Velocity based estimate||-|
Proposed node record states
- Requested - create request for node size X will be sent
- Assigned - create request returned a cloud node id, waiting to pair
- Paired - cloud node has pinged the API server initiating it has completed initialization, and is busy or idle (ready to accept work)
- don't record every idle<-->busy transition (slurm is the source of truth here)
- from here the transition table in nodemanager decides when to go into drain/shutdown state. this is based on:
- node status: busy/down/idle/unpaired
- shutdown windown open/closed (AWS billing optimization, could be removed)
- boot wait or boot exceeded
- idle wait or idle exceeded (how long to wait for more work, currently not implemented)
- Draining - will set "drain" state in SLURM, wait for work to complete
- Shutdown - shutdown request will be sent
- Gone - corresponding cloud node is no longer present in the cloud nodes table, record can be safely deleted.
API server gets explicit "state", "node_size" "cloud_node" columns.
Node manager determines next action based on state in nodes table, and is responsive to external changes to state. To create a new node, create a node record in "Requested" state. To shutdown a node, set its state to "Drain" or "Shutdown".
Wishlist items are fufilled by creating a new node record in "Request" state.