Project

General

Profile

Crunch v2 cloud scheduling » History » Version 1

Peter Amstutz, 12/07/2016 06:01 PM

1 1 Peter Amstutz
h1. Crunch v2 cloud scheduling
2
3
Options:
4
5
h2. SLURM (no node sharing)
6
7
Don't try to share nodes, run 1 container per node.
8
9
Extend existing "want list" logic in node manager to include queued/locked/running containers.
10
11
Tasks: update node manager; disable node sharing in slurm config.
12
13
h2. SLURM (support node sharing)
14
15
https://slurm.schedmd.com/elastic_computing.html
16
17
For each node type, list a range of nodes in slurm.conf.
18
19
Nodes are in "CLOUD" state which hides them from sinfo.
20
21
Slurm calls "ResumeProgram" and "SuspendProgram" with the nodename when it wants a node or is done with one.  These are responsible for creating and destroying cloud nodes.
22
23
"ResumeProgram" maps the nodename to the node type and tells the cloud to create a new node (which must be assigned the provided hostname).  This could involve a communication with node manager, or we write new programs that do one-off node creation and deletion.
24
25
If we use node manager, needs mechanism for signaling that specific nodes should be up/down.  Current "want list" only provides node sizes, so the "want list" must provide (hostname, nodesize) pairs.
26
27
Tasks are either:
28
29
* Determine how to communicate desired node state to node manager; update node manager; ResumeProgram/SuspendProgram are simple clients that just set the "desired up state" flag.
30
* Write new ResumeProgram/SuspendProgram programs
31
32
h2. Home-grown
33
34
h2. Something else
35
36
Mesos, Kubernetes, Open Lava, etc..
37
38
Unknown amount of effort.