Bug #14115

c-d-s reniceAll takes a very long time and holds up SqueueChecker updates and initial job submission

Added by Joshua Randall over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

reniceAll is taking many many times longer than the PollPeriod. see this partial c-d-s log output with added debug prints:

Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.check(): calling sqc.notify.Broadcast()
Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.start goroutine loop: check() took 2.149983887s
Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.start goroutine loop: reniceAll() took 1.16567ms
Aug 23 22:50:10 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:10 SqueueChecker.check(): calling sqc.notify.Broadcast()
Aug 23 22:50:10 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:10 SqueueChecker.start goroutine loop: check() took 2.160180571s
Aug 23 22:50:28 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:28 SqueueChecker.start goroutine loop: reniceAll() took 17.809860522s
Aug 23 22:50:28 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:28 SqueueChecker.start goroutine loop: consuming ticker without update
Aug 23 22:50:29 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:29 SqueueChecker.check(): calling sqc.notify.Broadcast()
Aug 23 22:50:29 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:29 SqueueChecker.start goroutine loop: check() took 1.109416296s
Aug 23 22:54:16 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:16 SqueueChecker.start goroutine loop: reniceAll() took 3m46.741228235s
Aug 23 22:54:16 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:16 SqueueChecker.start goroutine loop: consuming ticker without update
Aug 23 22:54:18 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.check(): calling sqc.notify.Broadcast()
Aug 23 22:54:19 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.start goroutine loop: check() took 391.544338ms

The comment at the top of reniceAll() admits that it is slow:

        // This is slow (it shells out to scontrol many times) and no
        // other goroutines update sqc.queue or any of the job fields
        // we use here, so we don't acquire a lock.

The comment was introduced recently along with removing the explicit locking in https://github.com/curoverse/arvados/commit/23574f4d4039641f9a9f624d3b16b64148de3826

The commit message there is: "13399: Don't block job submissions during release/renice batches." and although I cannot view issue 13399 the message suggests this is an attempt to stop reniceAll from blocking submission of jobs to SLURM. However, reniceAll still blocks the SqueueChecker main loop goroutine from getting to the next iteration of check, which in turn blocks the sqc.notify.BroadCast, so all goroutines that enter sqc.notify.Wait() while reniceAll is running will be blocked until it completes (and the next check completes). I would note that includes HasUUID() which is called at the beginning of runContainer, so it will delay the submission of the container until reniceAll returns.

History

#1 Updated by Joshua Randall over 2 years ago

I do not see the reason why reniceAll needs to be called synchronously during the squeue update. Why can't there be a separate priority update goroutine that triggers on sql.notify.wait like all of the runcontainer goroutines that rely on current squeue state?

#2 Updated by Joshua Randall over 2 years ago

btw, it is over an hour later and the latest log entry line is still:

Aug 23 22:54:19 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.start goroutine loop: check() took 391.544338ms

So it seems that the SqueueChecker loop has been stuck in reniceAll for over an hour.

#3 Updated by Joshua Randall over 2 years ago

I think the strategy that `reniceAll` is implementing to manage priority is just not workable with a large number of jobs, at least not on a v15 slurm cluster.

Other options:
- only set niceness on the top N (configurable) highest priority jobs (as priority really only matters for jobs with high priority)
- forget about the entire system of attempting to manage job priority using niceness and just set an appropriate nice value calculated directly from the actual user-specified container priority (i.e. something like base niceness - container.priority), such that a job is only ever reniced when the actual container priority changes. let slurm handle the other aspects of prioritisation (e.g. by enabling the multi-factor priority job scheduler which can adjust priority based on job age).

Also available in: Atom PDF