Feature #6520

[Node Manager] [Crunch2] Take queued containers into account when computing how many nodes should be up

Added by Tom Clegg about 2 years ago. Updated 5 months ago.

Status:ResolvedStart date:07/08/2015
Priority:NormalDue date:
Assignee:Peter Amstutz% Done:

100%

Category:Node Manager
Target version:2017-03-01 sprint
Story points0.5Remaining (hours)0.00 hour
Velocity based estimate-
ReleaseCrunch v2

Description

Add one node to the wishlist for each queued container, just like we currently add one (or more) nodes to the wishlist for queued jobs. While Crunch v2 will support running multiple containers per node, that's less critical in the cloud: as long as we can boot approximately the right size node, there's not too much overhead in just having one node per container. And it's something we can do relatively quickly with the current Node Manager code.

This won't be perfect from a scheduling perspective, especially in the interaction between Crunch v1 and Crunch v2. We expect that Crunch v2 jobs will generally "take priority" over Crunch v1 jobs, because SLURM will dispatch them from its own queue before crunch-dispatch has a chance to look and allocate nodes. We're OK with that limitation for the time being.

Node Manager should get the list of queued containers from SLURM itself, because that's the most direct source of truth about what is waiting to run. Node Manager can get information about the runtime constraints of each container either from SLURM, or from the Containers API.

Acceptance criteria:

  • Node Manager can generate a wishlist that is informed by containers in the SLURM queue. (Whether that's the existing wishlist or a new one is an implementation detail, not an acceptance criteria either way.)
  • The node sizes in that wishlist are the smallest able to meet the runtime constraints of the respective containers.
  • The Daemon actor considers these wishlist items when deciding whether or not to boot or shut down nodes, just as it does with the wishlist generated from the job queue today.

Implementation notes:

  • Node Manager will use sinfo to determine node status (alloc/idle/drained/down) instead of using the information from the node table. A crunch v2 installation won't store node state in the nodes table, other tools like Workbench will be modified accordingly.

Subtasks

Task #11031: Review 6520-nodemanager-crunchv2ResolvedPeter Amstutz

Task #11106: Review 6520-skip-compute0ResolvedPeter Amstutz

Task #11061: crunch-dispatch-slurm running on cloud clustersResolvedNico C├ęsar


Related issues

Related to Arvados - Story #6282: [Crunch] Write stories for implementation of Crunch v2 Resolved 06/23/2015
Blocked by Arvados - Story #6429: [API] [Crunch2] Implement "containers" and "container req... Resolved 12/03/2015

Associated revisions

Revision 24b1aecb
Added by Peter Amstutz 5 months ago

Merge branch '6520-skip-compute0' refs #6520

Revision 3fa4a2b6
Added by Peter Amstutz 5 months ago

Merge branch '6520-nodemanager-crunchv2' refs #6520

Revision 8b75947e
Added by Peter Amstutz 5 months ago

Merge branch '6520-pending-reason' refs #6520

Revision 91118e3a
Added by Peter Amstutz 4 months ago

Add missing documentation file. refs #6520

Revision 49510014
Added by Tom Clegg 4 months ago

Fix broken link from crunch2 to crunch1 docs.

refs #6520

History

#1 Updated by Tom Clegg about 2 years ago

  • Story points set to 1.0

#2 Updated by Brett Smith over 1 year ago

  • Target version set to Arvados Future Sprints

#4 Updated by Brett Smith over 1 year ago

  • Target version deleted (Arvados Future Sprints)
  • Release set to 11

#5 Updated by Peter Amstutz over 1 year ago

I've been studying the slurm elastic computing capability in more detail. I'm now leaning towards using continuing to use slurm in the cloud. I think some of the problems we've had with slurm are due to not implementing dynamic nodes in the recommended way.

Here's the gist of the configuration:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
SuspendTime=600
ResumeProgram=/usr/bin/create-node
SuspendProgram=/usr/bin/destroy-node

NodeName=small[0-255] Weight=1 Feature=cloud State=CLOUD CPUs=1 RealMemory=1024 TmpDisk=10000
NodeName=medium[0-255] Weight=2 Feature=cloud State=CLOUD CPUs=2 RealMemory=2048 TmpDisk=20000
NodeName=big[0-255] Weight=4 Feature=cloud State=CLOUD CPUs=4 RealMemory=4096 TmpDisk=40000

When slurm needs some nodes, it will call ResumeProgram with a list of the nodes it wants, e.g.

/usr/bin/create-node medium0 medium1 medium2 medium3

The create-node script is responsible for updating the node record so that slurmctld can establish communication with the node:

scontrol update nodename=medium0 nodeaddr=123.45.67.80 nodehostname=medium0
scontrol update nodename=medium1 nodeaddr=123.45.67.81 nodehostname=medium1
scontrol update nodename=medium2 nodeaddr=123.45.67.82 nodehostname=medium2
scontrol update nodename=medium3 nodeaddr=123.45.67.83 nodehostname=medium3

After nodes are idle for some period (SuspendTime) they will be shut down:

/usr/bin/destroy-node medium0 medium1 medium2 medium3

It's unclear what the role of node manager should be in this set up. I see two possibilities:

1) Eliminate node manager as a separate daemon, make two separate programs out of NodeSetupActor and NodeShutdownActor and discard the rest. This has a number of failure modes, but I mentioned it because it might be quicker/less disruptive to implement (but possibly less robust in the long run.)

2) Keep the node manager daemon, but change the architecture so that the "nodes" table on the API server has a static list of nodes and includes a flag whether each node should be up or down (so in the above example, there would be 768 entries). The create-node and destroy-node programs could just use "arv" to set the flag for the appropriate node on the nodes table. I prefer this option; the nodes table would become the single source of truth about what should be up or down and then node manager is responsible for converging on the desired state. It might also make sense for node manager to write slurm.conf to ensure that it is sync'd up with the actual contents of the nodes table.

#6 Updated by Brett Smith about 1 year ago

  • Description updated (diff)

#7 Updated by Brett Smith about 1 year ago

  • Story points changed from 1.0 to 2.0

#8 Updated by Peter Amstutz 6 months ago

  • Description updated (diff)

#9 Updated by Peter Amstutz 6 months ago

Disregard #note-5

#10 Updated by Tom Morris 6 months ago

  • Assignee set to Peter Amstutz
  • Target version set to 2017-02-15 sprint

#11 Updated by Peter Amstutz 6 months ago

job cpus, memory, tempory disk space, reason:

squeue --state=PENDING --noheader --format="%c %m %d %r" 

node hostname, state, cpu, memory, temp disk space

sinfo --noheader --format="%n %t %c %m %d" 

#12 Updated by Peter Amstutz 6 months ago

Problem: sbatch will fail "Requested node configuration is not available" if you try to add a job to the queue which slurm considers unsatisfiable (e.g. --cpus-per-task=2 but every entry slurm's compute list has cpus=1).

Workaround:

Submit the job with "--hold" and then use "scontrol release". This bypasses the configuration check.

#13 Updated by Peter Amstutz 6 months ago

One drawback of using the --hold workaround is that it also means we can't detect situations where the container is genuinely unsatisfiable. The underlying issue is that slurm's belief about machine sizes is assumed to be correct on-premise, but wrong in the cloud (because nodes are reconfigured on the fly).

Possible solution:

  • Make the "hold" trick optional. Enable it in the cloud, don't use it on on-premise clusters.
  • In the cloud environment, node manager has access to the node size list. It detects when a container is unsatisfiable and cancels it (or at least posts a message saying it is unsatisfiable).

#14 Updated by Peter Amstutz 6 months ago

Another idea from chat:

Set aside one or more node entries which are sized to the biggest machines available (might require two entries if most cores / biggest RAM are different node types). These nodes stays down permanently, but trick slurm into accepting jobs which would fit on that node. Need to ensure that these node entries don't actually get used.

#15 Updated by Tom Clegg 6 months ago

Ideally nodemanager will
  • ensure the slurm config for the fake node matches the characteristics of the largest node it knows how to bring up
  • ensure the API server's "node ping" action never assigns the fake node's hostname to a real node

#16 Updated by Peter Amstutz 5 months ago

  • Status changed from New to In Progress

#17 Updated by Peter Amstutz 5 months ago

It turns out you can't reconfigure hardware from "scontrol". You can only set the hardware configuration in slurm.conf or (apparently) when slurmd registers with slurm controller.

Per conversation with Ward:

  1. Set all nodes to default to the largest node size in slurm.conf, e.g.
    NodeName=DEFAULT State=UNKNOWN CPUs=20 RealMemory=80000
    
  2. Tweak API server so it never assigns slot 0, e.g.
         if self.slot_number.nil?
    -      try_slot = 0
    +      try_slot = 1
           begin
    

As a result, slurm will accept any job that does not exceed the maximum node size. However, when a node actually boots up, it should update itself in slurm with the correct node size, and will only accept correctly-sized jobs.

#18 Updated by Peter Amstutz 5 months ago

Skipping compute0 ensures that even if every other node compute1-compute255 in slurm are updated to smaller actual node sizes, it still accepts jobs wanting the maximum node size (because compute0 remains the maximum size).

#19 Updated by Peter Amstutz 5 months ago

  • Target version changed from 2017-02-15 sprint to 2017-03-01 sprint
  • Story points changed from 2.0 to 0.5

#20 Updated by Tom Clegg 5 months ago

in ArvadosNodeListMonitorActor it looks like "alloc*" will be reported as "down" -- instead of "alloc", which is how crunch-dispatch currently propagates it. Is this OK?

LGTM

#21 Updated by Peter Amstutz 5 months ago

2017-02-16_21:17:14.88880 2017-02-16 21:17:14 NodeManagerDaemonActor.23c4e7f6f7b8[48231] ERROR: while calculating nodes wanted for size <arvnodeman.jobqueue.CloudSizeWrapper object at 0x7f5d9c757590>
2017-02-16_21:17:14.88882 Traceback (most recent call last):
2017-02-16_21:17:14.88882   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 332, in update_server_wishlist
2017-02-16_21:17:14.88883     nodes_wanted = self._nodes_wanted(size)
2017-02-16_21:17:14.88883   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 290, in _nodes_wanted
2017-02-16_21:17:14.88883     counts = self._state_counts(size)
2017-02-16_21:17:14.88884   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 256, in _state_counts
2017-02-16_21:17:14.88884     states = self._node_states(size)
2017-02-16_21:17:14.88885   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 247, in _node_states
2017-02-16_21:17:14.88885     for rec in self.cloud_nodes.nodes.itervalues()
2017-02-16_21:17:14.88885   File "/usr/lib/python2.7/dist-packages/pykka/future.py", line 330, in get_all
2017-02-16_21:17:14.88885     return [future.get(timeout=timeout) for future in futures]
2017-02-16_21:17:14.88886   File "/usr/lib/python2.7/dist-packages/arvnodeman/daemon.py", line 249, in <genexpr>
2017-02-16_21:17:14.88886     rec.shutdown_actor is None))
2017-02-16_21:17:14.88886 AttributeError: 'NoneType' object has no attribute 'get_state'

#22 Updated by Peter Amstutz 5 months ago

Tom Clegg wrote:

in ArvadosNodeListMonitorActor it looks like "alloc*" will be reported as "down" -- instead of "alloc", which is how crunch-dispatch currently propagates it. Is this OK?

You're right. Fixed. Also added "drng" and "mix" (some cores allocated) to list of "busy" states.

#23 Updated by Peter Amstutz 5 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF