Bug #11190

Containers seem to run more than once, which isn't supposed to happen

Added by Tom Clegg over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
03/01/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description


Subtasks

Task #11263: ReviewResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #11166: [Crunch2] crunchrun.go should avoid name collisions when creating log collectionsResolved2017-02-24

Related to Arvados - Bug #11220: [SDKs] Fix misleading arv-mount/pysdk error messages by removing obsolete "fetch manifest from Keep" codeResolved2017-10-20

Related to Arvados - Bug #11561: [API] Limit number of lock/unlock cycles for a given containerNew2017-04-26

History

#2 Updated by Tom Clegg over 1 year ago

2017-03-01_17:27:35.24495 2017/03/01 17:27:35 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:27:35.24517 2017/03/01 17:27:35 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:27:35.35069 2017/03/01 17:27:35 sbatch succeeded: "Submitted batch job 2948" 
2017-03-01_17:27:35.35071 2017/03/01 17:27:35 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:37.15184 2017/03/01 17:29:37 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:42.32428 2017/03/01 17:29:42 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:46.97205 2017/03/01 17:29:46 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:51.83317 2017/03/01 17:29:51 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:56.42094 2017/03/01 17:29:56 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:57.89127 2017/03/01 17:29:57 Done monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:30:01.25862 2017/03/01 17:30:01 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:30:01.25865 2017/03/01 17:30:01 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:30:01.32075 2017/03/01 17:30:01 sbatch succeeded: "Submitted batch job 2949" 
2017-03-01_17:30:01.32077 2017/03/01 17:30:01 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:30:06.85462 2017/03/01 17:30:06 Dispatcher says container tb05z-dz642-eie1eal1059y9bb is done: cancel slurm job
2017-03-01_17:30:07.23672 2017/03/01 17:30:07 container tb05z-dz642-eie1eal1059y9bb is still in squeue after scancel
2017-03-01_17:30:13.53918 2017/03/01 17:30:13 Done monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:31:02.73009 2017/03/01 17:31:02 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:31:02.73013 2017/03/01 17:31:02 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:31:02.76251 2017/03/01 17:31:02 sbatch succeeded: "Submitted batch job 2950" 
2017-03-01_17:31:02.76253 2017/03/01 17:31:02 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:32:35.91008 2017/03/01 17:32:35 Done monitoring container tb05z-dz642-eie1eal1059y9bb

#3 Updated by Tom Morris over 1 year ago

  • Target version set to 2017-03-29 sprint

#4 Updated by Tom Clegg over 1 year ago

  • Category set to Crunch
  • Assigned To set to Tom Clegg

#5 Updated by Tom Clegg about 1 year ago

  • Target version changed from 2017-03-29 sprint to 2017-04-12 sprint

#6 Updated by Peter Amstutz about 1 year ago

I wonder if we should move the state transition to "Running" as soon as crunch-run has starting doing anything substantive. E.g. if it fails to load the Docker image, that shouldn't shouldn't put it back into Locked state, that should go Running->Cancelled.

#7 Updated by Tom Clegg about 1 year ago

  • Target version changed from 2017-04-12 sprint to 2017-04-26 sprint

#8 Updated by Tom Clegg about 1 year ago

Allowing multiple dispatch attempts is a deliberate feature: when the dispatch/startup infrastructure fails early enough that it's absolutely certain the container has never been started, we don't count an "attempt" against a container request.

Currently there is no limit on the number of lock-attempt-unlock cycles, though. We should have a site-configurable limit. This counter doesn't have to be visible to anyone except the api server, although it would be useful to expose it to admin clients for troubleshooting purposes.

#9 Updated by Tom Morris about 1 year ago

  • Target version changed from 2017-04-26 sprint to 2017-05-10 sprint

#10 Updated by Tom Clegg about 1 year ago

  • Status changed from New to Resolved

Also available in: Atom PDF