Crunch v2 cloud scheduling

Options:

SLURM (no node sharing)

Don't try to share nodes, run 1 container per node.

Extend existing "want list" logic in node manager to include queued/locked/running containers.

Tasks: update node manager; disable node sharing in slurm config.

SLURM (support node sharing)

https://slurm.schedmd.com/elastic_computing.html

For each node type, list a range of nodes in slurm.conf.

Nodes are in "CLOUD" state which hides them from sinfo.

Slurm calls "ResumeProgram" and "SuspendProgram" with the nodename when it wants a node or is done with one. These are responsible for creating and destroying cloud nodes.

"ResumeProgram" maps the nodename to the node type and tells the cloud to create a new node (which must be assigned the provided hostname). This could involve a communication with node manager, or we write new programs that do one-off node creation and deletion.

If we use node manager, needs mechanism for signaling that specific nodes should be up/down. Current "want list" only provides node sizes, so the "want list" must provide (hostname, nodesize) pairs.

Tasks are either:

  • Determine how to communicate desired node state to node manager; update node manager; ResumeProgram/SuspendProgram are simple clients that just set the "desired up state" flag.
  • Write new ResumeProgram/SuspendProgram programs

Home-grown

Something else

Mesos, Kubernetes, Open Lava, etc..

Unknown amount of effort.