Crunch v2 cloud scheduling » History » Version 1
Peter Amstutz, 12/07/2016 06:01 PM
1 | 1 | Peter Amstutz | h1. Crunch v2 cloud scheduling |
---|---|---|---|
2 | |||
3 | Options: |
||
4 | |||
5 | h2. SLURM (no node sharing) |
||
6 | |||
7 | Don't try to share nodes, run 1 container per node. |
||
8 | |||
9 | Extend existing "want list" logic in node manager to include queued/locked/running containers. |
||
10 | |||
11 | Tasks: update node manager; disable node sharing in slurm config. |
||
12 | |||
13 | h2. SLURM (support node sharing) |
||
14 | |||
15 | https://slurm.schedmd.com/elastic_computing.html |
||
16 | |||
17 | For each node type, list a range of nodes in slurm.conf. |
||
18 | |||
19 | Nodes are in "CLOUD" state which hides them from sinfo. |
||
20 | |||
21 | Slurm calls "ResumeProgram" and "SuspendProgram" with the nodename when it wants a node or is done with one. These are responsible for creating and destroying cloud nodes. |
||
22 | |||
23 | "ResumeProgram" maps the nodename to the node type and tells the cloud to create a new node (which must be assigned the provided hostname). This could involve a communication with node manager, or we write new programs that do one-off node creation and deletion. |
||
24 | |||
25 | If we use node manager, needs mechanism for signaling that specific nodes should be up/down. Current "want list" only provides node sizes, so the "want list" must provide (hostname, nodesize) pairs. |
||
26 | |||
27 | Tasks are either: |
||
28 | |||
29 | * Determine how to communicate desired node state to node manager; update node manager; ResumeProgram/SuspendProgram are simple clients that just set the "desired up state" flag. |
||
30 | * Write new ResumeProgram/SuspendProgram programs |
||
31 | |||
32 | h2. Home-grown |
||
33 | |||
34 | h2. Something else |
||
35 | |||
36 | Mesos, Kubernetes, Open Lava, etc.. |
||
37 | |||
38 | Unknown amount of effort. |