Crunch v2 cloud scheduling¶
SLURM (no node sharing)¶
Don't try to share nodes, run 1 container per node.
Extend existing "want list" logic in node manager to include queued/locked/running containers.
Tasks: update node manager; disable node sharing in slurm config.
SLURM (support node sharing)¶
For each node type, list a range of nodes in slurm.conf.
Nodes are in "CLOUD" state which hides them from sinfo.
Slurm calls "ResumeProgram" and "SuspendProgram" with the nodename when it wants a node or is done with one. These are responsible for creating and destroying cloud nodes.
"ResumeProgram" maps the nodename to the node type and tells the cloud to create a new node (which must be assigned the provided hostname). This could involve a communication with node manager, or we write new programs that do one-off node creation and deletion.
If we use node manager, needs mechanism for signaling that specific nodes should be up/down. Current "want list" only provides node sizes, so the "want list" must provide (hostname, nodesize) pairs.
Tasks are either:
- Determine how to communicate desired node state to node manager; update node manager; ResumeProgram/SuspendProgram are simple clients that just set the "desired up state" flag.
- Write new ResumeProgram/SuspendProgram programs
Mesos, Kubernetes, Open Lava, etc..
Unknown amount of effort.