Project

General

Profile

Bug #14596

Updated by Tom Clegg over 5 years ago

This seems to happen sometimes: 
 # Container is queued 
 # c-d-slurm submits a slurm job to run the container 
 # Something goes wrong -- perhaps the crunch-run process dies, or the slurm node goes down? 
 # -c-d-slurm c-d-slurm submits another slurm job, and- and the same container runs on a different node (we aren't certain c-d-slurm resubmits) 

 If a crunch-dispatch-slurm process finds a container with locked_by_uuid=self and state=Running that it is not already monitoring, and is not running according to slurm, this means the container has already been started somewhere, and has since exited. Therefore, crunch-dispatch-slurm should cancel the container, instead of submitting a new slurm job as it does now. 

 If a crunch-dispatch-slurm finds a container in the Arvados queue with locked_by_uuid=self and state=Locked but _that crunch-dispatch-slurm process did not lock that container itself_ and it is not running/queued according to slurm, this means the container was locked by a previous c-d-s process. Crunch-dispatch-slurm should unlock the container. Typically the container will be picked up again with state=Queued on the next queue poll. (Note: Unlocking and invalidating auth_uuid doesn't eliminate all possible side effects of having a stray container still running somewhere, but it should be done anyway.) 

Back