Project

General

Profile

Actions

Bug #13961

closed

[nodemanager] Be quicker applying fixup for node features

Added by Peter Amstutz over 5 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
Category:
-
Target version:
-
Story points:
-

Description

Nodes are spending too much time with node feature "(null)" and cannot be scheduled.

Node manager is responsible for fixing this up.

Increasing the frequency of node manager node list polling will make the fixup happen more often.

We also end up with a bunch of jobs with priority 0.

Crunch-dispatch-slurm needs to be better at unblocking these jobs.


Files

squeue_stats.py (1.88 KB) squeue_stats.py Peter Amstutz, 08/03/2018 04:51 PM

Subtasks 1 (0 open1 closed)

Task #13962: Review 13961-separate-pollingResolvedPeter Amstutz08/03/2018Actions
Actions #1

Updated by Peter Amstutz over 5 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz over 5 years ago

  • File squeue_stats.py squeue_stats.py added
  • Subject changed from [nodemanager] Excessive idle times to [nodemanager] Be quicker applying fixup for node features
  • Description updated (diff)
Actions #3

Updated by Peter Amstutz over 5 years ago

  • Description updated (diff)
Actions #4

Updated by Lucas Di Pentima over 5 years ago

Reviewing 13961-separate-polling - da14703fb4e1a249f47685b29310c4c69441ff08

On services/nodemanager/arvnodeman/launcher.py lines 86, 87 & 88 we could be using the poll_time var from line 83 as a fallback.
Also, the new values could be documented on the config file examples.
The rest LGTM.

Actions #6

Updated by Peter Amstutz over 5 years ago

Major problems causing long wait times for jobs:

  1. Node features going "(null)" after reconfigure, nodes can't be scheduled until it gets fixed (node manager does fixup)
  2. Jobs get held, can't be scheduled (crunch-dispatch-slurm does scontrol release)
  3. Jobs fall from their initial queue position (in the 4000000000 range) to a small (4-digit) queue position (makes them wait longer than intended to run)
  4. An instance type isn't available, nothing else will be scheduled until the job at the head of the line gets scheduled

Fixes

  1. Shorten poll time so nodes spend less time misconfigured
  2. shorter poll time (?) to fix jobs
  3. upgrade crunch-dispatch-slurm for possible bugfix
Actions #8

Updated by Tom Clegg over 5 years ago

Jobs fall from their initial queue position (in the 4000000000 range) to a small (4-digit) queue position

See #13399#note-33 for a fix that should (at least) prevent this from preventing the priority adjustment of other jobs.

Meanwhile, 9a80d15b7cab21efe16ec2b543dfb566bea9def4 from #13399 changed the threshold from 0 to 20K "scontrol release" so it's possible even the current version of c-d-s corrects the condition eventually.

Actions #10

Updated by Tom Morris over 5 years ago

  • Assigned To set to Peter Amstutz
Actions #11

Updated by Peter Amstutz over 5 years ago

So, experimentally while it is polling on a shorter cycle, it still seems to be taking a while for it to fix things. One possibility is that the "update actor" within node manager is getting backlogged. This could also offer an explanation for the wrong instance types sometimes being applied to nodes.

Actions #12

Updated by Tom Morris over 5 years ago

  • Target version changed from 2018-08-15 Sprint to 2018-09-05 Sprint
Actions #13

Updated by Tom Morris over 5 years ago

  • Target version changed from 2018-09-05 Sprint to Arvados Future Sprints
Actions #14

Updated by Peter Amstutz almost 3 years ago

  • Target version deleted (Arvados Future Sprints)
Actions #15

Updated by Peter Amstutz almost 3 years ago

  • Status changed from In Progress to Closed
Actions

Also available in: Atom PDF