Bug #11156

[Crunch2] 9tee4-xvhdp-0m3a4ys52od7dhs was stuck until c-d-s was restarted

Added by Ward Vandewege 28 days ago. Updated 27 days ago.

Status:ResolvedStart date:
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:Crunch
Target version:2017-03-01 sprint
Story points-
Velocity based estimate-

Description

9tee4-xvhdp-0m3a4ys52od7dhs was stuck in the running state for almost 16 hours, with one container request (step1_4) stuck in the 'queued' state, until I restarted c-d-s.

Here's the c-d-s log that shows it doing what is (presumably?) the right thing after restart:

...
2017-02-23_00:06:35.15054 2017/02/23 00:06:35 sbatch succeeded: "Submitted batch job 2285" 
2017-02-23_00:06:35.15055 2017/02/23 00:06:35 Start monitoring container 9tee4-dz642-cdtpbg58mkd5lcn
2017-02-23_00:06:43.93135 2017/02/23 00:06:43 Dispatcher says container 9tee4-dz642-cdtpbg58mkd5lcn is done: cancel slurm job
2017-02-23_00:06:44.13764 2017/02/23 00:06:44 container 9tee4-dz642-cdtpbg58mkd5lcn is still in squeue after scancel
2017-02-23_00:06:45.13795 2017/02/23 00:06:45 Dispatcher says container 9tee4-dz642-cdtpbg58mkd5lcn is done: cancel slurm job
2017-02-23_00:06:45.66892 2017/02/23 00:06:45 Done monitoring container 9tee4-dz642-cdtpbg58mkd5lcn
2017-02-23_00:06:53.14568 2017/02/23 00:06:53 Submitting container 9tee4-dz642-52ftkz1vjpvaz88 to slurm
2017-02-23_00:06:53.14586 2017/02/23 00:06:53 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-52ftkz1vjpvaz88" "--mem-per-cpu=1024" "--cpus-per-task=1"]
2017-02-23_00:06:53.16030 2017/02/23 00:06:53 sbatch succeeded: "Submitted batch job 2286" 
2017-02-23_00:06:53.16033 2017/02/23 00:06:53 Start monitoring container 9tee4-dz642-52ftkz1vjpvaz88
2017-02-23_00:07:00.93306 2017/02/23 00:07:00 Dispatcher says container 9tee4-dz642-52ftkz1vjpvaz88 is done: cancel slurm job
2017-02-23_00:07:01.13661 2017/02/23 00:07:01 container 9tee4-dz642-52ftkz1vjpvaz88 is still in squeue after scancel
2017-02-23_00:07:02.13691 2017/02/23 00:07:02 Dispatcher says container 9tee4-dz642-52ftkz1vjpvaz88 is done: cancel slurm job
2017-02-23_00:07:02.63899 2017/02/23 00:07:02 Dispatcher says container 9tee4-dz642-52ftkz1vjpvaz88 is done: cancel slurm job
2017-02-23_00:07:03.17184 2017/02/23 00:07:03 Done monitoring container 9tee4-dz642-52ftkz1vjpvaz88
2017-02-23_00:07:11.13969 2017/02/23 00:07:11 Submitting container 9tee4-dz642-1nz5151wg8aslnc to slurm
2017-02-23_00:07:11.14256 2017/02/23 00:07:11 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-1nz5151wg8aslnc" "--mem-per-cpu=1024" "--cpus-per-task=1"]
2017-02-23_00:07:11.15167 2017/02/23 00:07:11 sbatch succeeded: "Submitted batch job 2287" 
2017-02-23_00:07:11.15170 2017/02/23 00:07:11 Start monitoring container 9tee4-dz642-1nz5151wg8aslnc
2017-02-23_00:07:16.14010 2017/02/23 00:07:16 Submitting container 9tee4-dz642-bbv405og8s0h81k to slurm
2017-02-23_00:07:16.14031 2017/02/23 00:07:16 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-bbv405og8s0h81k" "--mem-per-cpu=1024" "--cpus-per-task=1"]
2017-02-23_00:07:16.16158 2017/02/23 00:07:16 sbatch succeeded: "Submitted batch job 2288" 
2017-02-23_00:07:16.16160 2017/02/23 00:07:16 Start monitoring container 9tee4-dz642-bbv405og8s0h81k
2017-02-23_00:07:16.63706 2017/02/23 00:07:16 Submitting container 9tee4-dz642-mr9hwu01760pv47 to slurm
2017-02-23_00:07:16.63720 2017/02/23 00:07:16 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-mr9hwu01760pv47" "--mem-per-cpu=1024" "--cpus-per-task=1"]
2017-02-23_00:07:16.65033 2017/02/23 00:07:16 sbatch succeeded: "Submitted batch job 2289" 
2017-02-23_00:07:16.65036 2017/02/23 00:07:16 Start monitoring container 9tee4-dz642-mr9hwu01760pv47
2017-02-23_00:07:22.16556 2017/02/23 00:07:22 Done monitoring container 9tee4-dz642-bbv405og8s0h81k
2017-02-23_00:07:25.17040 2017/02/23 00:07:25 Done monitoring container 9tee4-dz642-mr9hwu01760pv47
2017-02-23_16:05:09.12508 Stopping crunch-dispatch-slurm
2017-02-23_16:05:09.12872 Starting crunch-dispatch-slurm from /etc/sv/crunch-dispatch-slurm
2017-02-23_16:05:09.26425 2017/02/23 16:05:09 Start monitoring container 9tee4-dz642-1nz5151wg8aslnc
2017-02-23_16:05:09.75554 2017/02/23 16:05:09 debug: runner is handling updates slowly, discarded previous update for 9tee4-dz642-2a5iga7lyeg2rog
2017-02-23_16:05:09.77524 2017/02/23 16:05:09 Submitting container 9tee4-dz642-2a5iga7lyeg2rog to slurm
2017-02-23_16:05:09.77537 2017/02/23 16:05:09 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-2a5iga7lyeg2rog" "--mem-per-cpu=1024" "--cpus-per-task=1"]
2017-02-23_16:05:09.78927 2017/02/23 16:05:09 sbatch succeeded: "Submitted batch job 2290" 
2017-02-23_16:05:09.78929 2017/02/23 16:05:09 Start monitoring container 9tee4-dz642-2a5iga7lyeg2rog
2017-02-23_16:05:15.75276 2017/02/23 16:05:15 Dispatcher says container 9tee4-dz642-2a5iga7lyeg2rog is done: cancel slurm job
2017-02-23_16:05:15.78077 2017/02/23 16:05:15 container 9tee4-dz642-2a5iga7lyeg2rog is still in squeue after scancel
2017-02-23_16:05:16.78107 2017/02/23 16:05:16 Dispatcher says container 9tee4-dz642-2a5iga7lyeg2rog is done: cancel slurm job
2017-02-23_16:05:17.27646 2017/02/23 16:05:17 Dispatcher says container 9tee4-dz642-2a5iga7lyeg2rog is done: cancel slurm job
2017-02-23_16:05:17.77485 2017/02/23 16:05:17 Dispatcher says container 9tee4-dz642-2a5iga7lyeg2rog is done: cancel slurm job
2017-02-23_16:05:18.30208 2017/02/23 16:05:18 Done monitoring container 9tee4-dz642-2a5iga7lyeg2rog
2017-02-23_16:05:34.24789 2017/02/23 16:05:34 Dispatcher says container 9tee4-dz642-1nz5151wg8aslnc is done: cancel slurm job
2017-02-23_16:05:34.27596 2017/02/23 16:05:34 container 9tee4-dz642-1nz5151wg8aslnc is still in squeue after scancel
2017-02-23_16:05:35.27622 2017/02/23 16:05:35 Dispatcher says container 9tee4-dz642-1nz5151wg8aslnc is done: cancel slurm job
2017-02-23_16:05:35.79241 2017/02/23 16:05:35 Dispatcher says container 9tee4-dz642-1nz5151wg8aslnc is done: cancel slurm job
2017-02-23_16:05:36.27512 2017/02/23 16:05:36 Dispatcher says container 9tee4-dz642-1nz5151wg8aslnc is done: cancel slurm job
2017-02-23_16:05:36.80345 2017/02/23 16:05:36 Done monitoring container 9tee4-dz642-1nz5151wg8aslnc

Associated revisions

Revision 19dcf1ed
Added by Tom Clegg 28 days ago

11156: Fix infinite loop condition.

closes #11156

History

#1 Updated by Ward Vandewege 28 days ago

  • Subject changed from [Crunch2] 9tee4-xvhdp-0m3a4ys52od7dhs was stuck in queued state until c-d-s was restarted to [Crunch2] 9tee4-xvhdp-0m3a4ys52od7dhs was stuck until c-d-s was restarted
  • Description updated (diff)

#2 Updated by Ward Vandewege 28 days ago

  • Description updated (diff)

#3 Updated by Tom Clegg 28 days ago

Looks like 9tee4-dz642-2a5iga7lyeg2rog was queued the whole time, but c-d-s didn't pick it up until after restart for some reason.

#4 Updated by Tom Clegg 28 days ago

c-d-s was stuck in an infinite loop due to a bad condition in its "get next page" loop. Just kept repeating the same containers#list API call with "offset":2, until restart.

{"method":"GET","path":"/arvados/v1/containers","format":"html","controller":"arvados/v1/containers","action":"index","status":200,"duration":12.49,"view":0.42,"db":6.68,"params":{"filters":"[[\"state\",\"=\",\"Queued\"],[\"priority\",\"\\u003e\",\"0\"],[\"uuid\",\"not in\",[\"9tee4-dz642-1nz5151wg8aslnc\"]]]","offset":"2","order":"[\"priority desc\"]"},"@timestamp":"2017-02-23T00:07:16Z","@version":"1","message":"[200] GET /arvados/v1/containers (arvados/v1/containers#index)"}
{"method":"GET","path":"/arvados/v1/containers","format":"html","controller":"arvados/v1/containers","action":"index","status":200,"duration":11.92,"view":0.74,"db":6.9,"params":{"filters":"[[\"state\",\"=\",\"Queued\"],[\"priority\",\"\\u003e\",\"0\"],[\"uuid\",\"not in\",[\"9tee4-dz642-1nz5151wg8aslnc\"]]]","offset":"2","order":"[\"priority desc\"]"},"@timestamp":"2017-02-23T00:07:16Z","@version":"1","message":"[200] GET /arvados/v1/containers (arvados/v1/containers#index)"}
{"method":"GET","path":"/arvados/v1/containers","format":"html","controller":"arvados/v1/containers","action":"index","status":200,"duration":10.84,"view":0.27,"db":6.49,"params":{"filters":"[[\"state\",\"=\",\"Queued\"],[\"priority\",\"\\u003e\",\"0\"],[\"uuid\",\"not in\",[\"9tee4-dz642-1nz5151wg8aslnc\"]]]","offset":"2","order":"[\"priority desc\"]"},"@timestamp":"2017-02-23T00:07:16Z","@version":"1","message":"[200] GET /arvados/v1/containers (arvados/v1/containers#index)"}
{"method":"GET","path":"/arvados/v1/containers","format":"html","controller":"arvados/v1/containers","action":"index","status":200,"duration":11.27,"view":0.34,"db":6.83,"params":{"filters":"[[\"state\",\"=\",\"Queued\"],[\"priority\",\"\\u003e\",\"0\"],[\"uuid\",\"not in\",[\"9tee4-dz642-1nz5151wg8aslnc\"]]]","offset":"2","order":"[\"priority desc\"]"},"@timestamp":"2017-02-23T00:07:16Z","@version":"1","message":"[200] GET /arvados/v1/containers (arvados/v1/containers#index)"}
...

#5 Updated by Tom Clegg 28 days ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:19dcf1ed20214b8349c76037ba8679db07bf44fc.

#6 Updated by Tom Clegg 28 days ago

  • Category set to Crunch
  • Assignee set to Tom Clegg
  • Target version set to 2017-03-01 sprint

Also available in: Atom PDF