[Node Manager] [Crunch2] Take queued containers into account when computing how many nodes should be up
Add one node to the wishlist for each queued container, just like we currently add one (or more) nodes to the wishlist for queued jobs. While Crunch v2 will support running multiple containers per node, that's less critical in the cloud: as long as we can boot approximately the right size node, there's not too much overhead in just having one node per container. And it's something we can do relatively quickly with the current Node Manager code.
This won't be perfect from a scheduling perspective, especially in the interaction between Crunch v1 and Crunch v2. We expect that Crunch v2 jobs will generally "take priority" over Crunch v1 jobs, because SLURM will dispatch them from its own queue before crunch-dispatch has a chance to look and allocate nodes. We're OK with that limitation for the time being.
Node Manager should get the list of queued containers from SLURM itself, because that's the most direct source of truth about what is waiting to run. Node Manager can get information about the runtime constraints of each container either from SLURM, or from the Containers API.
- Node Manager can generate a wishlist that is informed by containers in the SLURM queue. (Whether that's the existing wishlist or a new one is an implementation detail, not an acceptance criteria either way.)
- The node sizes in that wishlist are the smallest able to meet the runtime constraints of the respective containers.
- The Daemon actor considers these wishlist items when deciding whether or not to boot or shut down nodes, just as it does with the wishlist generated from the job queue today.
- Node Manager will use sinfo to determine node status (alloc/idle/drained/down) instead of using the information from the node table. A crunch v2 installation won't store node state in the nodes table, other tools like Workbench will be modified accordingly.
#5 Updated by Peter Amstutz over 5 years ago
I've been studying the slurm elastic computing capability in more detail. I'm now leaning towards using continuing to use slurm in the cloud. I think some of the problems we've had with slurm are due to not implementing dynamic nodes in the recommended way.
Here's the gist of the configuration:
NodeName=small[0-255] Weight=1 Feature=cloud State=CLOUD CPUs=1 RealMemory=1024 TmpDisk=10000
NodeName=medium[0-255] Weight=2 Feature=cloud State=CLOUD CPUs=2 RealMemory=2048 TmpDisk=20000
NodeName=big[0-255] Weight=4 Feature=cloud State=CLOUD CPUs=4 RealMemory=4096 TmpDisk=40000
When slurm needs some nodes, it will call ResumeProgram with a list of the nodes it wants, e.g.
/usr/bin/create-node medium0 medium1 medium2 medium3
The create-node script is responsible for updating the node record so that slurmctld can establish communication with the node:
scontrol update nodename=medium0 nodeaddr=18.104.22.168 nodehostname=medium0
scontrol update nodename=medium1 nodeaddr=22.214.171.124 nodehostname=medium1
scontrol update nodename=medium2 nodeaddr=126.96.36.199 nodehostname=medium2
scontrol update nodename=medium3 nodeaddr=188.8.131.52 nodehostname=medium3
After nodes are idle for some period (SuspendTime) they will be shut down:
/usr/bin/destroy-node medium0 medium1 medium2 medium3
It's unclear what the role of node manager should be in this set up. I see two possibilities:
1) Eliminate node manager as a separate daemon, make two separate programs out of NodeSetupActor and NodeShutdownActor and discard the rest. This has a number of failure modes, but I mentioned it because it might be quicker/less disruptive to implement (but possibly less robust in the long run.)
2) Keep the node manager daemon, but change the architecture so that the "nodes" table on the API server has a static list of nodes and includes a flag whether each node should be up or down (so in the above example, there would be 768 entries). The create-node and destroy-node programs could just use "arv" to set the flag for the appropriate node on the nodes table. I prefer this option; the nodes table would become the single source of truth about what should be up or down and then node manager is responsible for converging on the desired state. It might also make sense for node manager to write slurm.conf to ensure that it is sync'd up with the actual contents of the nodes table.
#12 Updated by Peter Amstutz over 4 years ago
Problem: sbatch will fail "Requested node configuration is not available" if you try to add a job to the queue which slurm considers unsatisfiable (e.g. --cpus-per-task=2 but every entry slurm's compute list has cpus=1).
Submit the job with "--hold" and then use "scontrol release". This bypasses the configuration check.
#13 Updated by Peter Amstutz over 4 years ago
One drawback of using the --hold workaround is that it also means we can't detect situations where the container is genuinely unsatisfiable. The underlying issue is that slurm's belief about machine sizes is assumed to be correct on-premise, but wrong in the cloud (because nodes are reconfigured on the fly).
- Make the "hold" trick optional. Enable it in the cloud, don't use it on on-premise clusters.
- In the cloud environment, node manager has access to the node size list. It detects when a container is unsatisfiable and cancels it (or at least posts a message saying it is unsatisfiable).
#14 Updated by Peter Amstutz over 4 years ago
Another idea from chat:
Set aside one or more node entries which are sized to the biggest machines available (might require two entries if most cores / biggest RAM are different node types). These nodes stays down permanently, but trick slurm into accepting jobs which would fit on that node. Need to ensure that these node entries don't actually get used.
#17 Updated by Peter Amstutz over 4 years ago
It turns out you can't reconfigure hardware from "scontrol". You can only set the hardware configuration in slurm.conf or (apparently) when slurmd registers with slurm controller.
Per conversation with Ward:
- Set all nodes to default to the largest node size in slurm.conf, e.g.
NodeName=DEFAULT State=UNKNOWN CPUs=20 RealMemory=80000
- Tweak API server so it never assigns slot 0, e.g.
if self.slot_number.nil? - try_slot = 0 + try_slot = 1 begin
As a result, slurm will accept any job that does not exceed the maximum node size. However, when a node actually boots up, it should update itself in slurm with the correct node size, and will only accept correctly-sized jobs.
#21 Updated by Peter Amstutz over 4 years ago
2017-02-16_21:17:14.88880 2017-02-16 21:17:14 NodeManagerDaemonActor.23c4e7f6f7b8 ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper object at 0x7f5d9c757590> 2017-02-16_21:17:14.88882 Traceback (most recent call last): 2017-02-16_21:17:14.88882 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 332, in update_server_wishlist 2017-02-16_21:17:14.88883 nodes_wanted = self._nodes_wanted(size) 2017-02-16_21:17:14.88883 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 290, in _nodes_wanted 2017-02-16_21:17:14.88883 counts = self._state_counts(size) 2017-02-16_21:17:14.88884 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 256, in _state_counts 2017-02-16_21:17:14.88884 states = self._node_states(size) 2017-02-16_21:17:14.88885 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 247, in _node_states 2017-02-16_21:17:14.88885 for rec in self.cloud_nodes.nodes.itervalues() 2017-02-16_21:17:14.88885 File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 330, in get_all 2017-02-16_21:17:14.88885 return [future.get(timeout=timeout) for future in futures] 2017-02-16_21:17:14.88886 File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 249, in <genexpr> 2017-02-16_21:17:14.88886 rec.shutdown_actor is None)) 2017-02-16_21:17:14.88886 AttributeError: 'NoneType' object has no attribute 'get_state'
#22 Updated by Peter Amstutz over 4 years ago
Tom Clegg wrote:
in ArvadosNodeListMonitorActor it looks like "alloc*" will be reported as "down" -- instead of "alloc", which is how crunch-dispatch currently propagates it. Is this OK?
You're right. Fixed. Also added "drng" and "mix" (some cores allocated) to list of "busy" states.