Bug #13961

[nodemanager] Be quicker applying fixup for node features

Added by Peter Amstutz 4 months ago. Updated 4 months ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
08/03/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Nodes are spending too much time with node feature "(null)" and cannot be scheduled.

Node manager is responsible for fixing this up.

Increasing the frequency of node manager node list polling will make the fixup happen more often.

We also end up with a bunch of jobs with priority 0.

Crunch-dispatch-slurm needs to be better at unblocking these jobs.

squeue_stats.py (1.88 KB) squeue_stats.py Peter Amstutz, 08/03/2018 04:51 PM

Subtasks

Task #13962: Review 13961-separate-pollingResolvedPeter Amstutz

Associated revisions

Revision 4c393c35
Added by Peter Amstutz 4 months ago

Merge branch '13961-separate-polling' refs #13961

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Peter Amstutz 4 months ago

  • Status changed from New to In Progress

#2 Updated by Peter Amstutz 4 months ago

  • File squeue_stats.py squeue_stats.py added
  • Subject changed from [nodemanager] Excessive idle times to [nodemanager] Be quicker applying fixup for node features
  • Description updated (diff)

#3 Updated by Peter Amstutz 4 months ago

  • Description updated (diff)

#4 Updated by Lucas Di Pentima 4 months ago

Reviewing 13961-separate-polling - da14703fb4e1a249f47685b29310c4c69441ff08

On services/nodemanager/arvnodeman/launcher.py lines 86, 87 & 88 we could be using the poll_time var from line 83 as a fallback.
Also, the new values could be documented on the config file examples.
The rest LGTM.

#6 Updated by Peter Amstutz 4 months ago

Major problems causing long wait times for jobs:

  1. Node features going "(null)" after reconfigure, nodes can't be scheduled until it gets fixed (node manager does fixup)
  2. Jobs get held, can't be scheduled (crunch-dispatch-slurm does scontrol release)
  3. Jobs fall from their initial queue position (in the 4000000000 range) to a small (4-digit) queue position (makes them wait longer than intended to run)
  4. An instance type isn't available, nothing else will be scheduled until the job at the head of the line gets scheduled

Fixes

  1. Shorten poll time so nodes spend less time misconfigured
  2. shorter poll time (?) to fix jobs
  3. upgrade crunch-dispatch-slurm for possible bugfix

#8 Updated by Tom Clegg 4 months ago

Jobs fall from their initial queue position (in the 4000000000 range) to a small (4-digit) queue position

See #13399#note-33 for a fix that should (at least) prevent this from preventing the priority adjustment of other jobs.

Meanwhile, 9a80d15b7cab21efe16ec2b543dfb566bea9def4 from #13399 changed the threshold from 0 to 20K "scontrol release" so it's possible even the current version of c-d-s corrects the condition eventually.

#10 Updated by Tom Morris 4 months ago

  • Assigned To set to Peter Amstutz

#11 Updated by Peter Amstutz 4 months ago

So, experimentally while it is polling on a shorter cycle, it still seems to be taking a while for it to fix things. One possibility is that the "update actor" within node manager is getting backlogged. This could also offer an explanation for the wrong instance types sometimes being applied to nodes.

#12 Updated by Tom Morris 4 months ago

  • Target version changed from 2018-08-15 Sprint to 2018-09-05 Sprint

#13 Updated by Tom Morris 4 months ago

  • Target version changed from 2018-09-05 Sprint to Arvados Future Sprints

Also available in: Atom PDF