Project

General

Profile

Actions

Bug #13099

closed

crunch-dispatch-slurm occasionally logs a lot of "runner is handling updates slowly" debug messages

Added by Joshua Randall about 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
-
Story points:
-

Description

https://github.com/curoverse/arvados/blob/b51d376ed64efc68f7ee27fd061323da43faabd5/sdk/go/dispatch/dispatch.go#L304

Feb 16 16:14:49 arvados-master-ncucu crunch-dispatch-slurm[7060]: 2018/02/16 16:14:49 Start monitoring container ncucu-dz642-5125q7k7ulb68p7 in state "Locked" 
Feb 16 16:15:14 arvados-master-ncucu crunch-dispatch-slurm[7060]: 2018/02/16 16:15:14 debug: runner is handling updates slowly, discarded previous update for ncucu-dz642-5125q7k7ulb68p7
Feb 16 16:15:14 arvados-master-ncucu crunch-dispatch-slurm[7060]: 2018/02/16 16:15:14 debug: runner is handling updates slowly, discarded previous update for ncucu-dz642-ag8x3fj6kkcigy7
Feb 16 16:15:15 arvados-master-ncucu crunch-dispatch-slurm[7060]: 2018/02/16 16:15:15 debug: runner is handling updates slowly, discarded previous update for ncucu-dz642-5125q7k7ulb68p7
Feb 16 16:15:15 arvados-master-ncucu crunch-dispatch-slurm[7060]: 2018/02/16 16:15:15 debug: runner is handling updates slowly, discarded previous update for ncucu-dz642-ag8x3fj6kkcigy7
Feb 16 16:15:16 arvados-master-ncucu crunch-dispatch-slurm[7060]: 2018/02/16 16:15:16 debug: runner is handling updates slowly, discarded previous update for ncucu-dz642-5125q7k7ulb68p7
Feb 16 16:15:16 arvados-master-ncucu crunch-dispatch-slurm[7060]: 2018/02/16 16:15:16 debug: runner is handling updates slowly, discarded previous update for ncucu-dz642-ag8x3fj6kkcigy7
Actions #1

Updated by Joshua Randall about 6 years ago

We reviewed the code and the situation in which we are getting this message and it appears to occur when slurm is slow to respond to `squeue` invocations, such as when it has just updated node status or immediately after submitting a large number of jobs. Specifically, my hypothesis is that if `squeue` takes around the same amount of time as `PollPeriod`, it will result in this message (which I believe can safely be ignored).

It looks like the process that calls `squeue` first obtains a lock (at https://github.com/curoverse/arvados/blob/master/services/crunch-dispatch-slurm/squeue.go#L78) which seems to have two distinct purposes:

`GetNiceness` is called on every invocation of `renice` (https://github.com/curoverse/arvados/blob/master/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go#L348) which is itself called on every time through the update loop while the container is running with nonzero priority (https://github.com/curoverse/arvados/blob/master/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go#L326). So, if `squeue` takes about as long as `PollPeriod` (or longer) to return, the lock prevents `GetNiceness` from returning which holds the update loop at https://github.com/curoverse/arvados/blob/master/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go#L326 and prevents it from receiving on the channel at https://github.com/curoverse/arvados/blob/master/services/crunch-dispatch-slurm/crunch-dispatch-slurm.go#L318 during that `PollPeriod`, which results in the debug message being printed because there is still an update on the channel that has not been consumed.

Although this message (and discarding the update when `squeue` updates take took long) appears to be harmless, it could probably be prevented by using two separate mutexes rather than having a single one to protect both map accesses and `squeue`/`sbatch`/`scancel` logic.

Actions #2

Updated by Joshua Randall over 5 years ago

Update, it does not seem that this is completely harmless. On our system, when a container is Locked but crunch-dispatch-slurm dies or is restarted before actually calling sbatch to submit the job, when the new crunch-dispatch-slurm runs it can be stuck seemingly forever in a "runner is handling updates slowly" situation, all the time never having actually submitted the container to SLURM.

E.g.

$ arv get eglyx-dz642-f4kxh82m5r1vq2c | jq '.created_at, .modified_at, .state, .priority'
"2018-08-22T16:52:39.993287000Z" 
"2018-08-22T16:56:12.952979000Z" 
"Locked" 
1125898371885864000
$ grep crunch-dispatch-slurm /var/log/syslog | grep eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 16:58:59 arvados-master-eglyx crunch-dispatch-slurm[21774]: 2018/08/22 16:58:59 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:01:10 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:01:09 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:01:49 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:01:49 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:02:25 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:02:25 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:03:04 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:03:04 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:03:47 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:03:47 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:04:25 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:04:25 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:05:01 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:05:01 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:05:43 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:05:43 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:06:23 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:06:23 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:07:02 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:07:02 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:07:48 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:07:48 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:08:34 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:08:33 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:09:13 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:09:13 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:09:56 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:09:55 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:10:44 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:10:44 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:11:32 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:11:31 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:12:15 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:12:15 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:12:57 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:12:57 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:13:45 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:13:45 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:14:35 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:14:35 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
$ squeue -t all -o "%j" | grep eglyx-dz642-f4kxh82m5r1vq2c

Actions #3

Updated by Joshua Randall over 5 years ago

the full story for the previously reported container, including when it actually gets submitted to SLURM and completes:

$ grep crunch-dispatch-slurm /var/log/syslog.1 | grep eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 16:58:59 arvados-master-eglyx crunch-dispatch-slurm[21774]: 2018/08/22 16:58:59 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:01:10 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:01:09 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:01:49 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:01:49 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:02:25 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:02:25 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:03:04 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:03:04 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:03:47 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:03:47 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:04:25 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:04:25 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:05:01 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:05:01 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:05:43 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:05:43 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:06:23 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:06:23 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:07:02 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:07:02 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:07:48 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:07:48 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:08:34 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:08:33 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:09:13 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:09:13 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:09:56 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:09:55 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:10:44 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:10:44 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:11:32 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:11:31 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:12:15 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:12:15 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:12:57 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:12:57 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:13:45 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:13:45 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:14:35 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:14:35 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:15:20 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:15:20 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:16:04 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:16:04 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:16:48 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:16:48 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:17:19 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:17:19 Submitting container eglyx-dz642-f4kxh82m5r1vq2c to slurm
Aug 22 17:17:19 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:17:19 debug: submit getting Lock on sqCheck for container eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:17:40 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:17:40 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:18:34 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:18:34 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:19:26 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:19:26 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:20:17 arvados-master-eglyx crunch-dispatch-slurm[17318]: 2018/08/22 17:20:17 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:22:22 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:22:22 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:23:11 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:23:11 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:24:25 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:24:25 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:26:02 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:26:02 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:27:52 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:27:52 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:31:15 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:31:15 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:33:20 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:33:20 Submitting container eglyx-dz642-f4kxh82m5r1vq2c to slurm
Aug 22 17:33:20 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:33:20 debug: submit getting Lock on sqCheck for container eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:34:08 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:34:07 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:34:30 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:34:30 debug: submit calling sbatchArgs for container eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 17:34:30 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:34:30 running sbatch ["--partition=compute" "--job-name=eglyx-dz642-f4kxh82m5r1vq2c" "--nice=10000" "--mem=3840" "--cpus-per-task=2" "--tmp=0"]
Aug 22 17:34:30 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:34:30 "/usr/bin/sbatch" ["sbatch" "--partition=compute" "--job-name=eglyx-dz642-f4kxh82m5r1vq2c" "--nice=10000" "--mem=3840" "--cpus-per-task=2" "--tmp=0"]: "Submitted batch job 3584263" 
Aug 22 17:34:30 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:34:30 Start monitoring container eglyx-dz642-f4kxh82m5r1vq2c in state "Locked" 
Aug 22 17:51:09 arvados-master-eglyx crunch-dispatch-slurm[4409]: 2018/08/22 17:51:09 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 18:00:58 arvados-master-eglyx crunch-dispatch-slurm[13596]: 2018/08/22 18:00:58 Start monitoring container eglyx-dz642-f4kxh82m5r1vq2c in state "Running" 
Aug 22 18:11:27 arvados-master-eglyx crunch-dispatch-slurm[13596]: 2018/08/22 18:11:27 "/usr/bin/scontrol" ["scontrol" "update" "JobName=eglyx-dz642-f4kxh82m5r1vq2c" "Nice=538603"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000" 
Aug 22 18:11:27 arvados-master-eglyx crunch-dispatch-slurm[13596]: 2018/08/22 18:11:27 container "eglyx-dz642-f4kxh82m5r1vq2c" clamping nice values at 10000, priority order will not be correct -- see https://dev.arvados.org/projects/arvados/wiki/SLURM_integration#Limited-nice-values-SLURM-15
Aug 22 18:59:34 arvados-master-eglyx crunch-dispatch-slurm[13596]: 2018/08/22 18:59:34 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 19:01:26 arvados-master-eglyx crunch-dispatch-slurm[25573]: 2018/08/22 19:01:26 Start monitoring container eglyx-dz642-f4kxh82m5r1vq2c in state "Running" 
Aug 22 19:08:34 arvados-master-eglyx crunch-dispatch-slurm[25573]: 2018/08/22 19:08:34 "/usr/bin/scontrol" ["scontrol" "update" "JobName=eglyx-dz642-f4kxh82m5r1vq2c" "Nice=544789"]: "scontrol: error: Invalid nice value, must be between -10000 and 10000" 
Aug 22 19:08:34 arvados-master-eglyx crunch-dispatch-slurm[25573]: 2018/08/22 19:08:34 container "eglyx-dz642-f4kxh82m5r1vq2c" clamping nice values at 10000, priority order will not be correct -- see https://dev.arvados.org/projects/arvados/wiki/SLURM_integration#Limited-nice-values-SLURM-15
Aug 22 19:18:44 arvados-master-eglyx crunch-dispatch-slurm[25573]: 2018/08/22 19:18:44 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-f4kxh82m5r1vq2c
Aug 22 19:41:13 arvados-master-eglyx crunch-dispatch-slurm[25573]: 2018/08/22 19:41:13 debug: checkForUpdates will fetch 20 containers in batches of 10000 with filters [[uuid in [eglyx-dz642-e6prehf6on142bs eglyx-dz642-nkekzhhw9l4hvbw eglyx-dz642-zmihjb3s49ignta eglyx-dz642-f4kxh82m5r1vq2c eglyx-dz642-0jvy9f3o1oci4ho eglyx-dz642-3kt9aa5v8lipqak eglyx-dz642-oh7rn5pzfeelvyb eglyx-dz642-5tt2mpctjtiwx66 eglyx-dz642-k55a9gkrkfb9abe eglyx-dz642-vlez3r0cey95308 eglyx-dz642-uxpm2k2wm88c7vg eglyx-dz642-jzpj8mz663fpu71 eglyx-dz642-nyrdchy5z8kqbbu eglyx-dz642-mnko3g1btwmwzvc eglyx-dz642-gocdu6x6zpa76lb eglyx-dz642-y369cvp4lb04x04 eglyx-dz642-75imxe8nbd7g8do eglyx-dz642-5y8qgagccgov95i eglyx-dz642-g219satpso8rfpx eglyx-dz642-0bz9zzvqw8jtfy5]]]
Aug 22 19:41:13 arvados-master-eglyx crunch-dispatch-slurm[25573]: 2018/08/22 19:41:13 container eglyx-dz642-f4kxh82m5r1vq2c is done: cancel slurm job
Aug 22 19:48:40 arvados-master-eglyx crunch-dispatch-slurm[25573]: 2018/08/22 19:48:40 debug: checkForUpdates will fetch 20 containers in batches of 10000 with filters [[uuid in [eglyx-dz642-4fafxyhr8imyyud eglyx-dz642-k0g6z120p0cyv6v eglyx-dz642-mvu7xpi5jo1ycqs eglyx-dz642-f4kxh82m5r1vq2c eglyx-dz642-gmt0wkq94dlutos eglyx-dz642-oz7eoxvlimyqiw2 eglyx-dz642-8v27gdbvx7enxog eglyx-dz642-28zjflq28msmisc eglyx-dz642-zc4ltovm4v95cqf eglyx-dz642-g8qp6qqz305uiq9 eglyx-dz642-sti0ee06uq87pn8 eglyx-dz642-0n5j8l95rg7mzer eglyx-dz642-ptv8nztz74w9i56 eglyx-dz642-xywtnl4wdznycnp eglyx-dz642-vpptmj4519ba8xq eglyx-dz642-dasj3k421topxha eglyx-dz642-ihhw81bkkxuqjdy eglyx-dz642-dvq1bw4e2ik0iwo eglyx-dz642-li8hkcjeuwfvyxe eglyx-dz642-mb18i96i719idf5]]]
Aug 22 19:53:15 arvados-master-eglyx crunch-dispatch-slurm[25573]: 2018/08/22 19:53:15 debug: checkForUpdates will fetch 20 containers in batches of 10000 with filters [[uuid in [eglyx-dz642-z2dg5ezfi96atu2 eglyx-dz642-exyepwnj9fz6gbn eglyx-dz642-765htqid2f6c3mi eglyx-dz642-0trdnw658s14uys eglyx-dz642-f4kxh82m5r1vq2c eglyx-dz642-dp970o87unv22da eglyx-dz642-3b54pocbbv25egj eglyx-dz642-80j2ei42dl0ud0d eglyx-dz642-hy5mfmdr1bvapw3 eglyx-dz642-gr3mh9ukannjmjz eglyx-dz642-0kauigvmabjq8o9 eglyx-dz642-8cycvqzz2h2h1e9 eglyx-dz642-ocuih6lpu4nuh8c eglyx-dz642-r57nbar1uptjuec eglyx-dz642-ipqo7fuptni88vv eglyx-dz642-1ni5jz4jow8bzg3 eglyx-dz642-1kthyh9t92dq5tv eglyx-dz642-1z53am5ov6c8346 eglyx-dz642-jvgu32wtjz9klvl eglyx-dz642-4l8lavrfnf0b21x]]]
$ $ arv get eglyx-dz642-f4kxh82m5r1vq2c | jq '.created_at, .modified_at, .started_at, .finished_at'
"2018-08-22T16:52:39.993287000Z" 
"2018-08-22T19:25:16.308186000Z" 
"2018-08-22T17:35:45.500236000Z" 
"2018-08-22T19:25:16.156717000Z" 

A few points to consider:
- why did the c-d-s process started at 17:00:00 not attempt to start the container until 17:17?
- what happened at 17:17:19? It says it is submitting to SLURM but it does not end up calling sbatch to do so at that time.

Actions #7

Updated by Joshua Randall over 5 years ago

have now updated c-d-s to master which includes 0fe48142de81e20967eaac3972e35d115aaaa497 and 23574f4d4039641f9a9f624d3b16b64148de3826

Behaviour seems improved, but I've just caught it again doing the "discarded previous update" as the very first thing that is mentioned, but it only does it once and then actually starts the container a few seconds later, so I guess now it may be safe to ignore?

$ grep eglyx-dz642-ygdqbn2sg4e0pi4 /var/log/syslog
Aug 23 16:10:29 arvados-master-eglyx crunch-dispatch-slurm[11247]: 2018/08/23 16:10:28 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-ygdqbn2sg4e0pi4
Aug 23 16:10:35 arvados-master-eglyx crunch-dispatch-slurm[11247]: 2018/08/23 16:10:35 Submitting container eglyx-dz642-ygdqbn2sg4e0pi4 to slurm
Aug 23 16:10:36 arvados-master-eglyx crunch-dispatch-slurm[11247]: 2018/08/23 16:10:35 running sbatch ["--partition=compute" "--job-name=eglyx-dz642-ygdqbn2sg4e0pi4" "--nice=10000" "--mem=3840" "--cpus-per-task=2" "--tmp=0"]
Aug 23 16:10:59 arvados-master-eglyx crunch-dispatch-slurm[11247]: 2018/08/23 16:10:59 "/usr/bin/sbatch" ["sbatch" "--partition=compute" "--job-name=eglyx-dz642-ygdqbn2sg4e0pi4" "--nice=10000" "--mem=3840" "--cpus-per-task=2" "--tmp=0"]: "Submitted batch job 3743219" 
Aug 23 16:10:59 arvados-master-eglyx crunch-dispatch-slurm[11247]: 2018/08/23 16:10:59 Start monitoring container eglyx-dz642-ygdqbn2sg4e0pi4 in state "Locked" 
Aug 23 16:13:56 arvados-master-eglyx crunch-dispatch-slurm[11247]: 2018/08/23 16:13:56 debug: runner is handling updates slowly, discarded previous update for eglyx-dz642-ygdqbn2sg4e0pi4

Actions #8

Updated by Joshua Randall over 5 years ago

  • Status changed from New to Resolved

The updated c-d-s does appear to have resolved the problem that the sum of Locked + Running containers according to the API server was perpetually greater than the sum of PENDING and RUNNING containers in SLURM. Within two minutes of starting the new c-d-s the numbers became equal! :-)

$ while true; do arv container list -f '[["state","=","Queued"],["priority",">",0]]' -l 0 -c exact | jq -r '.items_available'; arv c
ontainer list -f '[["state","=","Locked"],["priority",">",0]]' -l 0 -c exact | jq -r '.items_available'; arv container list -f '[["state","=","Running"],["priority",">",0]]' -l 0
-c exact | jq -r '.items_available'; ssh arvados-master-eglyx 'squeue -ho "%j" -t PD,R | wc -l'; echo; sleep 20; done
2
41427
2490
32268

0
41386
2473
37333

0
41284
2495
43780

0
41240
2493
43731

0
41171
2507
43676

0
41154
2516
43670

41154 + 2516 = 43670

Actions

Also available in: Atom PDF