Bug #14596: [crunch-dispatch-slurm] Abandoned container starts again instead of being cancelled - Arvados

Bug #14596

This seems to happen sometimes: 
 # Container is queued 
 # c-d-slurm submits a slurm job to run the container 
 # Something goes wrong -- perhaps the crunch-run process dies, or the slurm node goes down? 
 # c-d-slurm submits another slurm job, and the same container runs on a different node 

 If c-d-slurm finds a container with locked_by_uuid=self and state=Running that is not already being monitored by c-d-slurm, and is not running according to slurm, it should cancel it instead of starting a new slurm job. 

 If c-d-slurm finds a container with locked_by_uuid=self and state=Locked that is not already being monitored by c-d-slurm, and is not running according to slurm, it should unlock and re-lock the container (thereby invalidating/renewing auth_uuid) instead of submitting another slurm job using the same auth_uuid.

Back

Project

General

Profile

Arvados

Bug #14596