Bug #14596
Updated by Tom Clegg almost 6 years ago
This seems to happen sometimes: # Container is queued # c-d-slurm submits a slurm job to run the container # Something goes wrong -- perhaps the crunch-run process dies, or the slurm node goes down? # c-d-slurm submits another slurm job, and the same container runs on a different node If c-d-slurm finds a container with locked_by_uuid=self and state=Running that is not running according to slurm, it should cancel it instead of starting a new slurm job. If c-d-slurm finds a container with locked_by_uuid=self and state=Locked that is not running according to slurm, it should unlock and re-lock the container (thereby invalidating/renewing auth_uuid) instead of submitting another slurm job using the same auth_uuid.