Project

General

Profile

Bug #11156

Updated by Ward Vandewege about 7 years ago

 
 9tee4-xvhdp-0m3a4ys52od7dhs was stuck in the running state for almost 16 hours, with one job in the 'queued' state, until I restarted c-d-s. 

 Here's the c-d-s log that shows it doing what is (presumably?) the right thing after restart: 

 <pre> 
 ... 
 2017-02-23_00:06:35.15054 2017/02/23 00:06:35 sbatch succeeded: "Submitted batch job 2285" 
 2017-02-23_00:06:35.15055 2017/02/23 00:06:35 Start monitoring container 9tee4-dz642-cdtpbg58mkd5lcn 
 2017-02-23_00:06:43.93135 2017/02/23 00:06:43 Dispatcher says container 9tee4-dz642-cdtpbg58mkd5lcn is done: cancel slurm job 
 2017-02-23_00:06:44.13764 2017/02/23 00:06:44 container 9tee4-dz642-cdtpbg58mkd5lcn is still in squeue after scancel 
 2017-02-23_00:06:45.13795 2017/02/23 00:06:45 Dispatcher says container 9tee4-dz642-cdtpbg58mkd5lcn is done: cancel slurm job 
 2017-02-23_00:06:45.66892 2017/02/23 00:06:45 Done monitoring container 9tee4-dz642-cdtpbg58mkd5lcn 
 2017-02-23_00:06:53.14568 2017/02/23 00:06:53 Submitting container 9tee4-dz642-52ftkz1vjpvaz88 to slurm 
 2017-02-23_00:06:53.14586 2017/02/23 00:06:53 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-52ftkz1vjpvaz88" "--mem-per-cpu=1024" "--cpus-per-task=1"] 
 2017-02-23_00:06:53.16030 2017/02/23 00:06:53 sbatch succeeded: "Submitted batch job 2286" 
 2017-02-23_00:06:53.16033 2017/02/23 00:06:53 Start monitoring container 9tee4-dz642-52ftkz1vjpvaz88 
 2017-02-23_00:07:00.93306 2017/02/23 00:07:00 Dispatcher says container 9tee4-dz642-52ftkz1vjpvaz88 is done: cancel slurm job 
 2017-02-23_00:07:01.13661 2017/02/23 00:07:01 container 9tee4-dz642-52ftkz1vjpvaz88 is still in squeue after scancel 
 2017-02-23_00:07:02.13691 2017/02/23 00:07:02 Dispatcher says container 9tee4-dz642-52ftkz1vjpvaz88 is done: cancel slurm job 
 2017-02-23_00:07:02.63899 2017/02/23 00:07:02 Dispatcher says container 9tee4-dz642-52ftkz1vjpvaz88 is done: cancel slurm job 
 2017-02-23_00:07:03.17184 2017/02/23 00:07:03 Done monitoring container 9tee4-dz642-52ftkz1vjpvaz88 
 2017-02-23_00:07:11.13969 2017/02/23 00:07:11 Submitting container 9tee4-dz642-1nz5151wg8aslnc to slurm 
 2017-02-23_00:07:11.14256 2017/02/23 00:07:11 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-1nz5151wg8aslnc" "--mem-per-cpu=1024" "--cpus-per-task=1"] 
 2017-02-23_00:07:11.15167 2017/02/23 00:07:11 sbatch succeeded: "Submitted batch job 2287" 
 2017-02-23_00:07:11.15170 2017/02/23 00:07:11 Start monitoring container 9tee4-dz642-1nz5151wg8aslnc 
 2017-02-23_00:07:16.14010 2017/02/23 00:07:16 Submitting container 9tee4-dz642-bbv405og8s0h81k to slurm 
 2017-02-23_00:07:16.14031 2017/02/23 00:07:16 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-bbv405og8s0h81k" "--mem-per-cpu=1024" "--cpus-per-task=1"] 
 2017-02-23_00:07:16.16158 2017/02/23 00:07:16 sbatch succeeded: "Submitted batch job 2288" 
 2017-02-23_00:07:16.16160 2017/02/23 00:07:16 Start monitoring container 9tee4-dz642-bbv405og8s0h81k 
 2017-02-23_00:07:16.63706 2017/02/23 00:07:16 Submitting container 9tee4-dz642-mr9hwu01760pv47 to slurm 
 2017-02-23_00:07:16.63720 2017/02/23 00:07:16 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-mr9hwu01760pv47" "--mem-per-cpu=1024" "--cpus-per-task=1"] 
 2017-02-23_00:07:16.65033 2017/02/23 00:07:16 sbatch succeeded: "Submitted batch job 2289" 
 2017-02-23_00:07:16.65036 2017/02/23 00:07:16 Start monitoring container 9tee4-dz642-mr9hwu01760pv47 
 2017-02-23_00:07:22.16556 2017/02/23 00:07:22 Done monitoring container 9tee4-dz642-bbv405og8s0h81k 
 2017-02-23_00:07:25.17040 2017/02/23 00:07:25 Done monitoring container 9tee4-dz642-mr9hwu01760pv47 
 2017-02-23_16:05:09.12508 Stopping crunch-dispatch-slurm 
 2017-02-23_16:05:09.12872 Starting crunch-dispatch-slurm from /etc/sv/crunch-dispatch-slurm 
 2017-02-23_16:05:09.26425 2017/02/23 16:05:09 Start monitoring container 9tee4-dz642-1nz5151wg8aslnc 
 2017-02-23_16:05:09.75554 2017/02/23 16:05:09 debug: runner is handling updates slowly, discarded previous update for 9tee4-dz642-2a5iga7lyeg2rog 
 2017-02-23_16:05:09.77524 2017/02/23 16:05:09 Submitting container 9tee4-dz642-2a5iga7lyeg2rog to slurm 
 2017-02-23_16:05:09.77537 2017/02/23 16:05:09 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-2a5iga7lyeg2rog" "--mem-per-cpu=1024" "--cpus-per-task=1"] 
 2017-02-23_16:05:09.78927 2017/02/23 16:05:09 sbatch succeeded: "Submitted batch job 2290" 
 2017-02-23_16:05:09.78929 2017/02/23 16:05:09 Start monitoring container 9tee4-dz642-2a5iga7lyeg2rog 
 2017-02-23_16:05:15.75276 2017/02/23 16:05:15 Dispatcher says container 9tee4-dz642-2a5iga7lyeg2rog is done: cancel slurm job 
 2017-02-23_16:05:15.78077 2017/02/23 16:05:15 container 9tee4-dz642-2a5iga7lyeg2rog is still in squeue after scancel 
 2017-02-23_16:05:16.78107 2017/02/23 16:05:16 Dispatcher says container 9tee4-dz642-2a5iga7lyeg2rog is done: cancel slurm job 
 2017-02-23_16:05:17.27646 2017/02/23 16:05:17 Dispatcher says container 9tee4-dz642-2a5iga7lyeg2rog is done: cancel slurm job 
 2017-02-23_16:05:17.77485 2017/02/23 16:05:17 Dispatcher says container 9tee4-dz642-2a5iga7lyeg2rog is done: cancel slurm job 
 2017-02-23_16:05:18.30208 2017/02/23 16:05:18 Done monitoring container 9tee4-dz642-2a5iga7lyeg2rog 
 2017-02-23_16:05:34.24789 2017/02/23 16:05:34 Dispatcher says container 9tee4-dz642-1nz5151wg8aslnc is done: cancel slurm job 
 2017-02-23_16:05:34.27596 2017/02/23 16:05:34 container 9tee4-dz642-1nz5151wg8aslnc is still in squeue after scancel 
 2017-02-23_16:05:35.27622 2017/02/23 16:05:35 Dispatcher says container 9tee4-dz642-1nz5151wg8aslnc is done: cancel slurm job 
 2017-02-23_16:05:35.79241 2017/02/23 16:05:35 Dispatcher says container 9tee4-dz642-1nz5151wg8aslnc is done: cancel slurm job 
 2017-02-23_16:05:36.27512 2017/02/23 16:05:36 Dispatcher says container 9tee4-dz642-1nz5151wg8aslnc is done: cancel slurm job 
 2017-02-23_16:05:36.80345 2017/02/23 16:05:36 Done monitoring container 9tee4-dz642-1nz5151wg8aslnc 
 </pre>

Back