Project

General

Profile

Actions

Bug #12057

closed

[node manager] boot nodes for jobs marked (Priority)

Added by Peter Amstutz over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

While running the node manager stress test, I noticed the following behavior from slurm:

            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             14269   compute c97qk-dz     root PD       0:00      1 (Resources)
             14270   compute c97qk-dz     root PD       0:00      1 (Priority)
             14271   compute c97qk-dz     root PD       0:00      1 (Priority)
             14272   compute c97qk-dz     root PD       0:00      1 (Priority)
             14273   compute c97qk-dz     root PD       0:00      1 (Priority)
             14274   compute c97qk-dz     root PD       0:00      1 (Priority)
             14275   compute c97qk-dz     root PD       0:00      1 (Priority)
             14276   compute c97qk-dz     root PD       0:00      1 (Priority)
             14277   compute c97qk-dz     root PD       0:00      1 (Priority)
             14278   compute c97qk-dz     root PD       0:00      1 (Priority)
             14279   compute c97qk-dz     root PD       0:00      1 (Priority)
             14280   compute c97qk-dz     root PD       0:00      1 (Priority)
             14281   compute c97qk-dz     root PD       0:00      1 (Priority)
             14282   compute c97qk-dz     root PD       0:00      1 (Priority)
             14283   compute c97qk-dz     root PD       0:00      1 (Priority)
             14284   compute c97qk-dz     root PD       0:00      1 (Priority)
             14285   compute c97qk-dz     root PD       0:00      1 (Priority)
             14286   compute c97qk-dz     root PD       0:00      1 (Priority)
             14287   compute c97qk-dz     root PD       0:00      1 (Priority)
             14288   compute c97qk-dz     root PD       0:00      1 (Priority)
             14289   compute c97qk-dz     root PD       0:00      1 (Priority)
             14290   compute c97qk-dz     root PD       0:00      1 (Priority)
             14291   compute c97qk-dz     root PD       0:00      1 (Priority)
             14292   compute c97qk-dz     root PD       0:00      1 (Priority)
             14293   compute c97qk-dz     root PD       0:00      1 (Priority)
             14294   compute c97qk-dz     root PD       0:00      1 (Priority)
             14296   compute c97qk-dz     root PD       0:00      1 (Priority)
             14297   compute c97qk-dz     root PD       0:00      1 (Priority)
             14298   compute c97qk-dz     root PD       0:00      1 (Priority)
             14299   compute c97qk-dz     root PD       0:00      1 (Priority)
             14300   compute c97qk-dz     root PD       0:00      1 (Priority)
             14302   compute c97qk-dz     root PD       0:00      1 (Priority)
             14303   compute c97qk-dz     root PD       0:00      1 (Priority)
             14306   compute c97qk-dz     root PD       0:00      1 (Priority)
             14307   compute c97qk-dz     root PD       0:00      1 (Priority)
             14308   compute c97qk-dz     root PD       0:00      1 (Priority)
             14309   compute c97qk-dz     root PD       0:00      1 (Priority)
             14310   compute c97qk-dz     root PD       0:00      1 (Priority)
             14311   compute c97qk-dz     root PD       0:00      1 (Priority)
             14312   compute c97qk-dz     root PD       0:00      1 (Priority)
             14186   compute c97qk-dz     root  R      20:12      1 compute8
             14259   compute c97qk-dz     root  R       0:59      1 compute3
             14260   compute c97qk-dz     root  R       0:58      1 compute13
             14261   compute c97qk-dz     root  R       0:51      1 compute4
             14262   compute c97qk-dz     root  R       0:50      1 compute5
             14263   compute c97qk-dz     root  R       0:42      1 compute6
             14264   compute c97qk-dz     root  R       0:30      1 compute9
             14265   compute c97qk-dz     root  R       0:30      1 compute12
             14266   compute c97qk-dz     root  R       0:28      1 compute11
             14267   compute c97qk-dz     root  R       0:27      1 compute10
             14268   compute c97qk-dz     root  R       0:17      1 compute7

Slurm only marks one pending job as being limited by (Resources) and the rest are limited by (Priority). Currently node manager only boots new nodes for jobs marked (Resources) and does not recognize (Priority). The effect is to dribble out one new node at a time instead of booting a lot of nodes at once, despite a deep queue.

Node manager should create nodes for slurm jobs marked (Priority).

Note: it might be a good idea to throttle node creation to help avoid API rate limits, and also avoid over-shooting and booting excessive new nodes (which can happen if a job completes and a queued job is assigned to an existing node, so that no new node is needed).

Suggest capping the number of nodes per type created in a round (perhaps 10).


Related issues

Related to Arvados - Idea #11545: Create a CWL stress test for node managerResolvedPeter Amstutz04/25/2017Actions
Actions #1

Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz over 6 years ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz over 6 years ago

12057-nodemanager-priority @ 43afe5cf4364c64b5022f912eaba2240c7cb0999

Actions #6

Updated by Tom Clegg over 6 years ago

Peter Amstutz wrote:

12057-nodemanager-priority @ 43afe5cf4364c64b5022f912eaba2240c7cb0999

LGTM

Actions #7

Updated by Tom Morris over 6 years ago

  • Status changed from New to In Progress
  • Target version set to 2017-08-02 sprint
Actions #8

Updated by Tom Morris over 6 years ago

Does this mean that #11545 is in progress?

Actions #9

Updated by Peter Amstutz over 6 years ago

  • Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
Actions #10

Updated by Peter Amstutz over 6 years ago

  • Assigned To set to Peter Amstutz
Actions #11

Updated by Peter Amstutz over 6 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:962a6b5be87305054a87b665dcc85d144840bb98.

Actions

Also available in: Atom PDF