Bug #11190
closedContainers seem to run more than once, which isn't supposed to happen
Updated by Tom Clegg almost 8 years ago
2017-03-01_17:27:35.24495 2017/03/01 17:27:35 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm 2017-03-01_17:27:35.24517 2017/03/01 17:27:35 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"] 2017-03-01_17:27:35.35069 2017/03/01 17:27:35 sbatch succeeded: "Submitted batch job 2948" 2017-03-01_17:27:35.35071 2017/03/01 17:27:35 Start monitoring container tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:29:37.15184 2017/03/01 17:29:37 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:29:42.32428 2017/03/01 17:29:42 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:29:46.97205 2017/03/01 17:29:46 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:29:51.83317 2017/03/01 17:29:51 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:29:56.42094 2017/03/01 17:29:56 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:29:57.89127 2017/03/01 17:29:57 Done monitoring container tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:30:01.25862 2017/03/01 17:30:01 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm 2017-03-01_17:30:01.25865 2017/03/01 17:30:01 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"] 2017-03-01_17:30:01.32075 2017/03/01 17:30:01 sbatch succeeded: "Submitted batch job 2949" 2017-03-01_17:30:01.32077 2017/03/01 17:30:01 Start monitoring container tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:30:06.85462 2017/03/01 17:30:06 Dispatcher says container tb05z-dz642-eie1eal1059y9bb is done: cancel slurm job 2017-03-01_17:30:07.23672 2017/03/01 17:30:07 container tb05z-dz642-eie1eal1059y9bb is still in squeue after scancel 2017-03-01_17:30:13.53918 2017/03/01 17:30:13 Done monitoring container tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:31:02.73009 2017/03/01 17:31:02 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm 2017-03-01_17:31:02.73013 2017/03/01 17:31:02 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"] 2017-03-01_17:31:02.76251 2017/03/01 17:31:02 sbatch succeeded: "Submitted batch job 2950" 2017-03-01_17:31:02.76253 2017/03/01 17:31:02 Start monitoring container tb05z-dz642-eie1eal1059y9bb 2017-03-01_17:32:35.91008 2017/03/01 17:32:35 Done monitoring container tb05z-dz642-eie1eal1059y9bb
Updated by Tom Morris almost 8 years ago
- Target version set to 2017-03-29 sprint
Updated by Tom Clegg almost 8 years ago
- Category set to Crunch
- Assigned To set to Tom Clegg
Updated by Tom Clegg almost 8 years ago
- Target version changed from 2017-03-29 sprint to 2017-04-12 sprint
Updated by Peter Amstutz almost 8 years ago
I wonder if we should move the state transition to "Running" as soon as crunch-run has starting doing anything substantive. E.g. if it fails to load the Docker image, that shouldn't shouldn't put it back into Locked state, that should go Running->Cancelled.
Updated by Tom Clegg over 7 years ago
- Target version changed from 2017-04-12 sprint to 2017-04-26 sprint
Updated by Tom Clegg over 7 years ago
Allowing multiple dispatch attempts is a deliberate feature: when the dispatch/startup infrastructure fails early enough that it's absolutely certain the container has never been started, we don't count an "attempt" against a container request.
Currently there is no limit on the number of lock-attempt-unlock cycles, though. We should have a site-configurable limit. This counter doesn't have to be visible to anyone except the api server, although it would be useful to expose it to admin clients for troubleshooting purposes.
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-04-26 sprint to 2017-05-10 sprint