Crunch v2 cloud scheduling » History » Version 1

Peter Amstutz, 12/07/2016 06:01 PM

1 1 Peter Amstutz
h1. Crunch v2 cloud scheduling
2 1 Peter Amstutz
3 1 Peter Amstutz
Options:
4 1 Peter Amstutz
5 1 Peter Amstutz
h2. SLURM (no node sharing)
6 1 Peter Amstutz
7 1 Peter Amstutz
Don't try to share nodes, run 1 container per node.
8 1 Peter Amstutz
9 1 Peter Amstutz
Extend existing "want list" logic in node manager to include queued/locked/running containers.
10 1 Peter Amstutz
11 1 Peter Amstutz
Tasks: update node manager; disable node sharing in slurm config.
12 1 Peter Amstutz
13 1 Peter Amstutz
h2. SLURM (support node sharing)
14 1 Peter Amstutz
15 1 Peter Amstutz
https://slurm.schedmd.com/elastic_computing.html
16 1 Peter Amstutz
17 1 Peter Amstutz
For each node type, list a range of nodes in slurm.conf.
18 1 Peter Amstutz
19 1 Peter Amstutz
Nodes are in "CLOUD" state which hides them from sinfo.
20 1 Peter Amstutz
21 1 Peter Amstutz
Slurm calls "ResumeProgram" and "SuspendProgram" with the nodename when it wants a node or is done with one.  These are responsible for creating and destroying cloud nodes.
22 1 Peter Amstutz
23 1 Peter Amstutz
"ResumeProgram" maps the nodename to the node type and tells the cloud to create a new node (which must be assigned the provided hostname).  This could involve a communication with node manager, or we write new programs that do one-off node creation and deletion.
24 1 Peter Amstutz
25 1 Peter Amstutz
If we use node manager, needs mechanism for signaling that specific nodes should be up/down.  Current "want list" only provides node sizes, so the "want list" must provide (hostname, nodesize) pairs.
26 1 Peter Amstutz
27 1 Peter Amstutz
Tasks are either:
28 1 Peter Amstutz
29 1 Peter Amstutz
* Determine how to communicate desired node state to node manager; update node manager; ResumeProgram/SuspendProgram are simple clients that just set the "desired up state" flag.
30 1 Peter Amstutz
* Write new ResumeProgram/SuspendProgram programs
31 1 Peter Amstutz
32 1 Peter Amstutz
h2. Home-grown
33 1 Peter Amstutz
34 1 Peter Amstutz
h2. Something else
35 1 Peter Amstutz
36 1 Peter Amstutz
Mesos, Kubernetes, Open Lava, etc..
37 1 Peter Amstutz
38 1 Peter Amstutz
Unknown amount of effort.