Bug #12057
Updated by Peter Amstutz over 7 years ago
While running the node manager stress test, I noticed the following behavior from slurm:
<pre>
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
14269 compute c97qk-dz root PD 0:00 1 (Resources)
14270 compute c97qk-dz root PD 0:00 1 (Priority)
14271 compute c97qk-dz root PD 0:00 1 (Priority)
14272 compute c97qk-dz root PD 0:00 1 (Priority)
14273 compute c97qk-dz root PD 0:00 1 (Priority)
14274 compute c97qk-dz root PD 0:00 1 (Priority)
14275 compute c97qk-dz root PD 0:00 1 (Priority)
14276 compute c97qk-dz root PD 0:00 1 (Priority)
14277 compute c97qk-dz root PD 0:00 1 (Priority)
14278 compute c97qk-dz root PD 0:00 1 (Priority)
14279 compute c97qk-dz root PD 0:00 1 (Priority)
14280 compute c97qk-dz root PD 0:00 1 (Priority)
14281 compute c97qk-dz root PD 0:00 1 (Priority)
14282 compute c97qk-dz root PD 0:00 1 (Priority)
14283 compute c97qk-dz root PD 0:00 1 (Priority)
14284 compute c97qk-dz root PD 0:00 1 (Priority)
14285 compute c97qk-dz root PD 0:00 1 (Priority)
14286 compute c97qk-dz root PD 0:00 1 (Priority)
14287 compute c97qk-dz root PD 0:00 1 (Priority)
14288 compute c97qk-dz root PD 0:00 1 (Priority)
14289 compute c97qk-dz root PD 0:00 1 (Priority)
14290 compute c97qk-dz root PD 0:00 1 (Priority)
14291 compute c97qk-dz root PD 0:00 1 (Priority)
14292 compute c97qk-dz root PD 0:00 1 (Priority)
14293 compute c97qk-dz root PD 0:00 1 (Priority)
14294 compute c97qk-dz root PD 0:00 1 (Priority)
14296 compute c97qk-dz root PD 0:00 1 (Priority)
14297 compute c97qk-dz root PD 0:00 1 (Priority)
14298 compute c97qk-dz root PD 0:00 1 (Priority)
14299 compute c97qk-dz root PD 0:00 1 (Priority)
14300 compute c97qk-dz root PD 0:00 1 (Priority)
14302 compute c97qk-dz root PD 0:00 1 (Priority)
14303 compute c97qk-dz root PD 0:00 1 (Priority)
14306 compute c97qk-dz root PD 0:00 1 (Priority)
14307 compute c97qk-dz root PD 0:00 1 (Priority)
14308 compute c97qk-dz root PD 0:00 1 (Priority)
14309 compute c97qk-dz root PD 0:00 1 (Priority)
14310 compute c97qk-dz root PD 0:00 1 (Priority)
14311 compute c97qk-dz root PD 0:00 1 (Priority)
14312 compute c97qk-dz root PD 0:00 1 (Priority)
14186 compute c97qk-dz root R 20:12 1 compute8
14259 compute c97qk-dz root R 0:59 1 compute3
14260 compute c97qk-dz root R 0:58 1 compute13
14261 compute c97qk-dz root R 0:51 1 compute4
14262 compute c97qk-dz root R 0:50 1 compute5
14263 compute c97qk-dz root R 0:42 1 compute6
14264 compute c97qk-dz root R 0:30 1 compute9
14265 compute c97qk-dz root R 0:30 1 compute12
14266 compute c97qk-dz root R 0:28 1 compute11
14267 compute c97qk-dz root R 0:27 1 compute10
14268 compute c97qk-dz root R 0:17 1 compute7
</pre>
Slurm only marks one pending job as being limited by (Resources) and the rest are limited by (Priority). Currently node manager only boots new nodes for jobs marked (Resources) and does not recognize (Priority). The effect is to dribble out one new node at a time instead of booting a lot of nodes at once, despite a deep queue.
Node manager should create nodes for slurm jobs marked (Priority).
Idea: to avoid over-shooting and booting excessive nodes (which can happen if a job completes and a node is assigned to a queued job, so no new node is needed), suggest returning a desired additional node count that is 75% the pending job count (rounded up). For example, if 4 nodes were needed, it would boot 3 nodes on the 1st round, then 1 node on the next round.
Alternately, we could cap the number of nodes created in a round (perhaps 10).
It may also be desirable to throttle node creation to avoid API rate limits.
(rounds are about 60 second apart).