Bug #12057
closed[node manager] boot nodes for jobs marked (Priority)
Description
While running the node manager stress test, I noticed the following behavior from slurm:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14269 compute c97qk-dz root PD 0:00 1 (Resources) 14270 compute c97qk-dz root PD 0:00 1 (Priority) 14271 compute c97qk-dz root PD 0:00 1 (Priority) 14272 compute c97qk-dz root PD 0:00 1 (Priority) 14273 compute c97qk-dz root PD 0:00 1 (Priority) 14274 compute c97qk-dz root PD 0:00 1 (Priority) 14275 compute c97qk-dz root PD 0:00 1 (Priority) 14276 compute c97qk-dz root PD 0:00 1 (Priority) 14277 compute c97qk-dz root PD 0:00 1 (Priority) 14278 compute c97qk-dz root PD 0:00 1 (Priority) 14279 compute c97qk-dz root PD 0:00 1 (Priority) 14280 compute c97qk-dz root PD 0:00 1 (Priority) 14281 compute c97qk-dz root PD 0:00 1 (Priority) 14282 compute c97qk-dz root PD 0:00 1 (Priority) 14283 compute c97qk-dz root PD 0:00 1 (Priority) 14284 compute c97qk-dz root PD 0:00 1 (Priority) 14285 compute c97qk-dz root PD 0:00 1 (Priority) 14286 compute c97qk-dz root PD 0:00 1 (Priority) 14287 compute c97qk-dz root PD 0:00 1 (Priority) 14288 compute c97qk-dz root PD 0:00 1 (Priority) 14289 compute c97qk-dz root PD 0:00 1 (Priority) 14290 compute c97qk-dz root PD 0:00 1 (Priority) 14291 compute c97qk-dz root PD 0:00 1 (Priority) 14292 compute c97qk-dz root PD 0:00 1 (Priority) 14293 compute c97qk-dz root PD 0:00 1 (Priority) 14294 compute c97qk-dz root PD 0:00 1 (Priority) 14296 compute c97qk-dz root PD 0:00 1 (Priority) 14297 compute c97qk-dz root PD 0:00 1 (Priority) 14298 compute c97qk-dz root PD 0:00 1 (Priority) 14299 compute c97qk-dz root PD 0:00 1 (Priority) 14300 compute c97qk-dz root PD 0:00 1 (Priority) 14302 compute c97qk-dz root PD 0:00 1 (Priority) 14303 compute c97qk-dz root PD 0:00 1 (Priority) 14306 compute c97qk-dz root PD 0:00 1 (Priority) 14307 compute c97qk-dz root PD 0:00 1 (Priority) 14308 compute c97qk-dz root PD 0:00 1 (Priority) 14309 compute c97qk-dz root PD 0:00 1 (Priority) 14310 compute c97qk-dz root PD 0:00 1 (Priority) 14311 compute c97qk-dz root PD 0:00 1 (Priority) 14312 compute c97qk-dz root PD 0:00 1 (Priority) 14186 compute c97qk-dz root R 20:12 1 compute8 14259 compute c97qk-dz root R 0:59 1 compute3 14260 compute c97qk-dz root R 0:58 1 compute13 14261 compute c97qk-dz root R 0:51 1 compute4 14262 compute c97qk-dz root R 0:50 1 compute5 14263 compute c97qk-dz root R 0:42 1 compute6 14264 compute c97qk-dz root R 0:30 1 compute9 14265 compute c97qk-dz root R 0:30 1 compute12 14266 compute c97qk-dz root R 0:28 1 compute11 14267 compute c97qk-dz root R 0:27 1 compute10 14268 compute c97qk-dz root R 0:17 1 compute7
Slurm only marks one pending job as being limited by (Resources) and the rest are limited by (Priority). Currently node manager only boots new nodes for jobs marked (Resources) and does not recognize (Priority). The effect is to dribble out one new node at a time instead of booting a lot of nodes at once, despite a deep queue.
Node manager should create nodes for slurm jobs marked (Priority).
Note: it might be a good idea to throttle node creation to help avoid API rate limits, and also avoid over-shooting and booting excessive new nodes (which can happen if a job completes and a queued job is assigned to an existing node, so that no new node is needed).
Suggest capping the number of nodes per type created in a round (perhaps 10).
Updated by Peter Amstutz over 7 years ago
12057-nodemanager-priority @ 43afe5cf4364c64b5022f912eaba2240c7cb0999
Updated by Tom Clegg over 7 years ago
Updated by Tom Morris over 7 years ago
- Status changed from New to In Progress
- Target version set to 2017-08-02 sprint
Updated by Tom Morris over 7 years ago
Does this mean that #11545 is in progress?
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
Updated by Peter Amstutz over 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|commit:962a6b5be87305054a87b665dcc85d144840bb98.