Bug #11190

Containers seem to run more than once, which isn't supposed to happen

Added by Tom Clegg 5 months ago. Updated 3 months ago.

Status:ResolvedStart date:03/01/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:Crunch
Target version:2017-05-10 sprint
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

Example: tb05z-dz642-eie1eal1059y9bb


Subtasks

Task #11263: ReviewResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #11166: [Crunch2] crunchrun.go should avoid name collisions when ... Resolved 02/24/2017
Related to Arvados - Bug #11220: [SDKs] Fix misleading arv-mount/pysdk error messages by r... New
Related to Arvados - Bug #11561: [API] Limit number of lock/unlock cycles for a given cont... New 04/26/2017

History

#2 Updated by Tom Clegg 5 months ago

2017-03-01_17:27:35.24495 2017/03/01 17:27:35 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:27:35.24517 2017/03/01 17:27:35 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:27:35.35069 2017/03/01 17:27:35 sbatch succeeded: "Submitted batch job 2948" 
2017-03-01_17:27:35.35071 2017/03/01 17:27:35 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:37.15184 2017/03/01 17:29:37 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:42.32428 2017/03/01 17:29:42 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:46.97205 2017/03/01 17:29:46 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:51.83317 2017/03/01 17:29:51 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:56.42094 2017/03/01 17:29:56 debug: runner is handling updates slowly, discarded previous update for tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:29:57.89127 2017/03/01 17:29:57 Done monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:30:01.25862 2017/03/01 17:30:01 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:30:01.25865 2017/03/01 17:30:01 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:30:01.32075 2017/03/01 17:30:01 sbatch succeeded: "Submitted batch job 2949" 
2017-03-01_17:30:01.32077 2017/03/01 17:30:01 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:30:06.85462 2017/03/01 17:30:06 Dispatcher says container tb05z-dz642-eie1eal1059y9bb is done: cancel slurm job
2017-03-01_17:30:07.23672 2017/03/01 17:30:07 container tb05z-dz642-eie1eal1059y9bb is still in squeue after scancel
2017-03-01_17:30:13.53918 2017/03/01 17:30:13 Done monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:31:02.73009 2017/03/01 17:31:02 Submitting container tb05z-dz642-eie1eal1059y9bb to slurm
2017-03-01_17:31:02.73013 2017/03/01 17:31:02 exec sbatch ["sbatch" "--share" "--workdir=/tmp" "--job-name=tb05z-dz642-eie1eal1059y9bb" "--mem-per-cpu=6250" "--cpus-per-task=8"]
2017-03-01_17:31:02.76251 2017/03/01 17:31:02 sbatch succeeded: "Submitted batch job 2950" 
2017-03-01_17:31:02.76253 2017/03/01 17:31:02 Start monitoring container tb05z-dz642-eie1eal1059y9bb
2017-03-01_17:32:35.91008 2017/03/01 17:32:35 Done monitoring container tb05z-dz642-eie1eal1059y9bb

#3 Updated by Tom Morris 5 months ago

  • Target version set to 2017-03-29 sprint

#4 Updated by Tom Clegg 4 months ago

  • Category set to Crunch
  • Assignee set to Tom Clegg

#5 Updated by Tom Clegg 4 months ago

  • Target version changed from 2017-03-29 sprint to 2017-04-12 sprint

#6 Updated by Peter Amstutz 4 months ago

I wonder if we should move the state transition to "Running" as soon as crunch-run has starting doing anything substantive. E.g. if it fails to load the Docker image, that shouldn't shouldn't put it back into Locked state, that should go Running->Cancelled.

#7 Updated by Tom Clegg 3 months ago

  • Target version changed from 2017-04-12 sprint to 2017-04-26 sprint

#8 Updated by Tom Clegg 3 months ago

Allowing multiple dispatch attempts is a deliberate feature: when the dispatch/startup infrastructure fails early enough that it's absolutely certain the container has never been started, we don't count an "attempt" against a container request.

Currently there is no limit on the number of lock-attempt-unlock cycles, though. We should have a site-configurable limit. This counter doesn't have to be visible to anyone except the api server, although it would be useful to expose it to admin clients for troubleshooting purposes.

#9 Updated by Tom Morris 3 months ago

  • Target version changed from 2017-04-26 sprint to 2017-05-10 sprint

#10 Updated by Tom Clegg 3 months ago

  • Status changed from New to Resolved

Also available in: Atom PDF