Project

General

Profile

Actions

Bug #14115

open

c-d-s reniceAll takes a very long time and holds up SqueueChecker updates and initial job submission

Added by Joshua Randall over 5 years ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

reniceAll is taking many many times longer than the PollPeriod. see this partial c-d-s log output with added debug prints:

Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.check(): calling sqc.notify.Broadcast()
Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.start goroutine loop: check() took 2.149983887s
Aug 23 22:50:00 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:00 SqueueChecker.start goroutine loop: reniceAll() took 1.16567ms
Aug 23 22:50:10 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:10 SqueueChecker.check(): calling sqc.notify.Broadcast()
Aug 23 22:50:10 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:10 SqueueChecker.start goroutine loop: check() took 2.160180571s
Aug 23 22:50:28 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:28 SqueueChecker.start goroutine loop: reniceAll() took 17.809860522s
Aug 23 22:50:28 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:28 SqueueChecker.start goroutine loop: consuming ticker without update
Aug 23 22:50:29 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:29 SqueueChecker.check(): calling sqc.notify.Broadcast()
Aug 23 22:50:29 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:50:29 SqueueChecker.start goroutine loop: check() took 1.109416296s
Aug 23 22:54:16 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:16 SqueueChecker.start goroutine loop: reniceAll() took 3m46.741228235s
Aug 23 22:54:16 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:16 SqueueChecker.start goroutine loop: consuming ticker without update
Aug 23 22:54:18 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.check(): calling sqc.notify.Broadcast()
Aug 23 22:54:19 arvados-master-eglyx crunch-dispatch-slurm[5633]: 2018/08/23 22:54:18 SqueueChecker.start goroutine loop: check() took 391.544338ms

The comment at the top of reniceAll() admits that it is slow:

        // This is slow (it shells out to scontrol many times) and no
        // other goroutines update sqc.queue or any of the job fields
        // we use here, so we don't acquire a lock.

The comment was introduced recently along with removing the explicit locking in https://github.com/curoverse/arvados/commit/23574f4d4039641f9a9f624d3b16b64148de3826

The commit message there is: "13399: Don't block job submissions during release/renice batches." and although I cannot view issue 13399 the message suggests this is an attempt to stop reniceAll from blocking submission of jobs to SLURM. However, reniceAll still blocks the SqueueChecker main loop goroutine from getting to the next iteration of check, which in turn blocks the sqc.notify.BroadCast, so all goroutines that enter sqc.notify.Wait() while reniceAll is running will be blocked until it completes (and the next check completes). I would note that includes HasUUID() which is called at the beginning of runContainer, so it will delay the submission of the container until reniceAll returns.

Actions

Also available in: Atom PDF