Project

General

Profile

Bug #12057

Updated by Peter Amstutz over 7 years ago

While running the node manager stress test, I noticed the following behavior from slurm: 

 <pre> 
             JOBID PARTITION       NAME       USER ST         TIME    NODES NODELIST(REASON) 
              14269     compute c97qk-dz       root PD         0:00        1 (Resources) 
              14270     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14271     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14272     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14273     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14274     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14275     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14276     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14277     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14278     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14279     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14280     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14281     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14282     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14283     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14284     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14285     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14286     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14287     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14288     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14289     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14290     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14291     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14292     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14293     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14294     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14296     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14297     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14298     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14299     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14300     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14302     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14303     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14306     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14307     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14308     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14309     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14310     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14311     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14312     compute c97qk-dz       root PD         0:00        1 (Priority) 
              14186     compute c97qk-dz       root    R        20:12        1 compute8 
              14259     compute c97qk-dz       root    R         0:59        1 compute3 
              14260     compute c97qk-dz       root    R         0:58        1 compute13 
              14261     compute c97qk-dz       root    R         0:51        1 compute4 
              14262     compute c97qk-dz       root    R         0:50        1 compute5 
              14263     compute c97qk-dz       root    R         0:42        1 compute6 
              14264     compute c97qk-dz       root    R         0:30        1 compute9 
              14265     compute c97qk-dz       root    R         0:30        1 compute12 
              14266     compute c97qk-dz       root    R         0:28        1 compute11 
              14267     compute c97qk-dz       root    R         0:27        1 compute10 
              14268     compute c97qk-dz       root    R         0:17        1 compute7 
 </pre> 

 Slurm only marks one pending job as being limited by (Resources) and the rest are limited by (Priority).    Currently node manager only boots new nodes for jobs marked (Resources) and does not recognize (Priority).    The effect is to dribble out one new node at a time instead of booting a lot of nodes at once, despite a deep queue. 

 Node manager should create nodes for slurm jobs marked (Priority). 

 Idea: to avoid over-shooting and booting excessive nodes (which can happen if a job completes and a node is assigned to a queued job, so no new node is needed), suggest returning a desired additional node count that is 75% the pending job count (rounded up).    For example, if 4 nodes were needed, it would boot 3 nodes on the 1st round, then 1 node on the next round. 

 Alternately, we could cap the number of nodes created in a round (perhaps 10). 

 It may also be desirable to throttle node creation to avoid API rate limits. 

 (rounds are about 60 second apart). 

Back