Project

General

Profile

Bug #14596

Updated by Tom Clegg over 5 years ago

This seems to happen sometimes: 
 # Container is queued 
 # c-d-slurm submits a slurm job to run the container 
 # Something goes wrong -- perhaps the crunch-run process dies, or the slurm node goes down? 
 # c-d-slurm submits another slurm job, and the same container runs on a different node 

 If a crunch-dispatch-slurm process c-d-slurm finds a container with locked_by_uuid=self and state=Running that it is not already monitoring, being monitored by c-d-slurm, and is not running according to slurm, this means the container has already been started somewhere, and has since exited. Therefore, crunch-dispatch-slurm it should cancel the container, it instead of submitting starting a new slurm job as it does now. job. 

 If a crunch-dispatch-slurm c-d-slurm finds a container in the Arvados queue with locked_by_uuid=self and state=Locked but _that crunch-dispatch-slurm process did that is not lock that container itself_ already being monitored by c-d-slurm, and it is not running/queued running according to slurm, this means the container was locked by a previous c-d-s process. Crunch-dispatch-slurm it should unlock and re-lock the container. Typically the container will be picked up again with state=Queued on (thereby invalidating/renewing auth_uuid) instead of submitting another slurm job using the next queue poll. (Note: Unlocking and invalidating auth_uuid doesn't eliminate all possible side effects of having a stray container still running somewhere, but it should be done anyway.) same auth_uuid. 

Back