Bug #13961
closed[nodemanager] Be quicker applying fixup for node features
Description
Nodes are spending too much time with node feature "(null)" and cannot be scheduled.
Node manager is responsible for fixing this up.
Increasing the frequency of node manager node list polling will make the fixup happen more often.
We also end up with a bunch of jobs with priority 0.
Crunch-dispatch-slurm needs to be better at unblocking these jobs.
Files
Updated by Peter Amstutz over 5 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 5 years ago
- File squeue_stats.py squeue_stats.py added
- Subject changed from [nodemanager] Excessive idle times to [nodemanager] Be quicker applying fixup for node features
- Description updated (diff)
Updated by Lucas Di Pentima over 5 years ago
Reviewing 13961-separate-polling
- da14703fb4e1a249f47685b29310c4c69441ff08
On services/nodemanager/arvnodeman/launcher.py
lines 86, 87 & 88 we could be using the poll_time
var from line 83 as a fallback.
Also, the new values could be documented on the config file examples.
The rest LGTM.
Updated by Peter Amstutz over 5 years ago
Major problems causing long wait times for jobs:
- Node features going "(null)" after reconfigure, nodes can't be scheduled until it gets fixed (node manager does fixup)
- Jobs get held, can't be scheduled (crunch-dispatch-slurm does scontrol release)
- Jobs fall from their initial queue position (in the 4000000000 range) to a small (4-digit) queue position (makes them wait longer than intended to run)
- An instance type isn't available, nothing else will be scheduled until the job at the head of the line gets scheduled
Fixes
- Shorten poll time so nodes spend less time misconfigured
- shorter poll time (?) to fix jobs
- upgrade crunch-dispatch-slurm for possible bugfix
Updated by Tom Clegg over 5 years ago
Jobs fall from their initial queue position (in the 4000000000 range) to a small (4-digit) queue position
See #13399#note-33 for a fix that should (at least) prevent this from preventing the priority adjustment of other jobs.
Meanwhile, 9a80d15b7cab21efe16ec2b543dfb566bea9def4 from #13399 changed the threshold from 0 to 20K "scontrol release" so it's possible even the current version of c-d-s corrects the condition eventually.
Updated by Peter Amstutz over 5 years ago
So, experimentally while it is polling on a shorter cycle, it still seems to be taking a while for it to fix things. One possibility is that the "update actor" within node manager is getting backlogged. This could also offer an explanation for the wrong instance types sometimes being applied to nodes.
Updated by Tom Morris over 5 years ago
- Target version changed from 2018-08-15 Sprint to 2018-09-05 Sprint
Updated by Tom Morris over 5 years ago
- Target version changed from 2018-09-05 Sprint to Arvados Future Sprints
Updated by Peter Amstutz almost 3 years ago
- Target version deleted (
Arvados Future Sprints)
Updated by Peter Amstutz almost 3 years ago
- Status changed from In Progress to Closed