[nodemanager] Be quicker applying fixup for node features
Nodes are spending too much time with node feature "(null)" and cannot be scheduled.
Node manager is responsible for fixing this up.
Increasing the frequency of node manager node list polling will make the fixup happen more often.
We also end up with a bunch of jobs with priority 0.
Crunch-dispatch-slurm needs to be better at unblocking these jobs.
#4 Updated by Lucas Di Pentima 7 months ago
13961-separate-polling - da14703fb4e1a249f47685b29310c4c69441ff08
services/nodemanager/arvnodeman/launcher.py lines 86, 87 & 88 we could be using the
poll_time var from line 83 as a fallback.
Also, the new values could be documented on the config file examples.
The rest LGTM.
#6 Updated by Peter Amstutz 7 months ago
Major problems causing long wait times for jobs:
- Node features going "(null)" after reconfigure, nodes can't be scheduled until it gets fixed (node manager does fixup)
- Jobs get held, can't be scheduled (crunch-dispatch-slurm does scontrol release)
- Jobs fall from their initial queue position (in the 4000000000 range) to a small (4-digit) queue position (makes them wait longer than intended to run)
- An instance type isn't available, nothing else will be scheduled until the job at the head of the line gets scheduled
- Shorten poll time so nodes spend less time misconfigured
- shorter poll time (?) to fix jobs
- upgrade crunch-dispatch-slurm for possible bugfix
Jobs fall from their initial queue position (in the 4000000000 range) to a small (4-digit) queue position
See #13399#note-33 for a fix that should (at least) prevent this from preventing the priority adjustment of other jobs.
Meanwhile, 9a80d15b7cab21efe16ec2b543dfb566bea9def4 from #13399 changed the threshold from 0 to 20K "scontrol release" so it's possible even the current version of c-d-s corrects the condition eventually.
#11 Updated by Peter Amstutz 6 months ago
So, experimentally while it is polling on a shorter cycle, it still seems to be taking a while for it to fix things. One possibility is that the "update actor" within node manager is getting backlogged. This could also offer an explanation for the wrong instance types sometimes being applied to nodes.