Bug #14115
openc-d-s reniceAll takes a very long time and holds up SqueueChecker updates and initial job submission
Description
reniceAll is taking many many times longer than the PollPeriod. see this partial c-d-s log output with added debug prints:
Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.check(): calling sqc.notify.Broadcast() Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.start goroutine loop: check() took 2.149983887s Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.start goroutine loop: reniceAll() took 1.16567ms Aug 23 22:50:10 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:10 SqueueChecker.check(): calling sqc.notify.Broadcast() Aug 23 22:50:10 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:10 SqueueChecker.start goroutine loop: check() took 2.160180571s Aug 23 22:50:28 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:28 SqueueChecker.start goroutine loop: reniceAll() took 17.809860522s Aug 23 22:50:28 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:28 SqueueChecker.start goroutine loop: consuming ticker without update Aug 23 22:50:29 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:29 SqueueChecker.check(): calling sqc.notify.Broadcast() Aug 23 22:50:29 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:29 SqueueChecker.start goroutine loop: check() took 1.109416296s Aug 23 22:54:16 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:16 SqueueChecker.start goroutine loop: reniceAll() took 3m46.741228235s Aug 23 22:54:16 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:16 SqueueChecker.start goroutine loop: consuming ticker without update Aug 23 22:54:18 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.check(): calling sqc.notify.Broadcast() Aug 23 22:54:19 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.start goroutine loop: check() took 391.544338ms
The comment at the top of reniceAll() admits that it is slow:
// This is slow (it shells out to scontrol many times) and no // other goroutines update sqc.queue or any of the job fields // we use here, so we don't acquire a lock.
The comment was introduced recently along with removing the explicit locking in https://github.com/curoverse/arvados/commit/23574f4d4039641f9a9f624d3b16b64148de3826
The commit message there is: "13399: Don't block job submissions during release/renice batches." and although I cannot view issue 13399 the message suggests this is an attempt to stop reniceAll from blocking submission of jobs to SLURM. However, reniceAll still blocks the SqueueChecker main loop goroutine from getting to the next iteration of check, which in turn blocks the sqc.notify.BroadCast, so all goroutines that enter sqc.notify.Wait() while reniceAll is running will be blocked until it completes (and the next check completes). I would note that includes HasUUID() which is called at the beginning of runContainer, so it will delay the submission of the container until reniceAll returns.
Updated by Joshua Randall over 6 years ago
I do not see the reason why reniceAll needs to be called synchronously during the squeue update. Why can't there be a separate priority update goroutine that triggers on sql.notify.wait like all of the runcontainer goroutines that rely on current squeue state?
Updated by Joshua Randall over 6 years ago
btw, it is over an hour later and the latest log entry line is still:
Aug 23 22:54:19 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.start goroutine loop: check() took 391.544338ms
So it seems that the SqueueChecker loop has been stuck in reniceAll for over an hour.
Updated by Joshua Randall over 6 years ago
I think the strategy that `reniceAll` is implementing to manage priority is just not workable with a large number of jobs, at least not on a v15 slurm cluster.
Other options:
- only set niceness on the top N (configurable) highest priority jobs (as priority really only matters for jobs with high priority)
- forget about the entire system of attempting to manage job priority using niceness and just set an appropriate nice value calculated directly from the actual user-specified container priority (i.e. something like base niceness - container.priority), such that a job is only ever reniced when the actual container priority changes. let slurm handle the other aspects of prioritisation (e.g. by enabling the multi-factor priority job scheduler which can adjust priority based on job age).