Bug #14596
closed[crunch-dispatch-slurm] Abandoned container starts again instead of being cancelled
Added by Tom Clegg almost 6 years ago. Updated over 5 years ago.
Description
- Container is queued
- c-d-slurm submits a slurm job to run the container
- Something goes wrong -- perhaps the crunch-run process dies, or the slurm node goes down?
c-d-slurm submits another slurm job, andthe same container runs on a different node (we aren't certain c-d-slurm resubmits)
Updated by Tom Morris almost 6 years ago
- Target version changed from To Be Groomed to 2018-12-21 Sprint
Updated by Peter Amstutz almost 6 years ago
If c-d-slurm finds a container with locked_by_uuid=self and state=Running that is not running according to slurm, it should cancel it instead of starting a new slurm job.
It does this already.
If c-d-slurm finds a container with locked_by_uuid=self and state=Locked that is not running according to slurm, it should unlock and re-lock the container (thereby invalidating/renewing auth_uuid) instead of submitting another slurm job using the same auth_uuid.
It does this already, too.
However, auth_uuid isn't used to update the container, the dispatcher token is. So invalidating the auth_uuid token doesn't guarantee that a lingering crunch-run will be prevented from updating the container.
Updated by Tom Clegg almost 6 years ago
Can you note where in the code this happens? Problems are being reported in 1.2.0. Perhaps the bugs have been fixed since then. Otherwise we should go back and find a different explanation.
Updated by Peter Amstutz almost 6 years ago
Tom Clegg wrote:
Can you note where in the code this happens? Problems are being reported in 1.2.0. Perhaps the bugs have been fixed since then. Otherwise we should go back and find a different explanation.
crunch-dispatch-slurm.go L339:
case <-ctx.Done(): // Disappeared from squeue if err := disp.Arv.Get("containers", ctr.UUID, nil, &ctr); err != nil { log.Printf("error getting final container state for %s: %s", ctr.UUID, err) } switch ctr.State { case dispatch.Running: disp.UpdateState(ctr.UUID, dispatch.Cancelled) case dispatch.Locked: disp.Unlock(ctr.UUID) } return
Going back in time with git blame, this logic has been there since at least early 2017.
Updated by Tom Clegg almost 6 years ago
I see runContainer makes one attempt to cancel the container if it disappears from the slurm queue. But if that API call fails for any reason, it looks like c-d-s will completely forget about the container, and next time sdk/go/dispatch sees it in the Arvados queue, it will start it again.
Updated description to clarify.
Updated by Tom Clegg almost 6 years ago
Hm. Looking at runContainer I don't see how runContainer would re-submit a slurm job for a container that already has state=Running. Still a missing piece to this puzzle?
Updated by Tom Clegg almost 6 years ago
- update container record to state=Locked
- update container record to state=Running
- (container runs, fails)
- (crunch-run doesn't finalize the container record)
- (time passes)
- update container record to state=Running (was already state=Running)
- (container runs again)
Could slurm be deciding to retry these jobs on its own? If so, maybe we can disable that behavior, but it might be impossible to avoid completely. As a backup plan we could also have crunch-run confirm state=Locked before doing anything else.
Updated by Tom Clegg almost 6 years ago
"Jobs may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. If JobRequeue is set to a value of 1, then batch job may be requeued unless explicitly disabled by the user. [...] The default value is 1." https://slurm.schedmd.com/slurm.conf.html
"--no-requeue
Specifies that the batch job should never be requeued under any circumstances. Setting this option will prevent system administrators from being able to restart the job (for example, after a scheduled downtime), recover from a node failure, or be requeued upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning." https://slurm.schedmd.com/sbatch.html
(On the cluster where this problem is seen, JobRequeue is not specified, so defaulting to 1.)
Updated by Tom Clegg almost 6 years ago
- Status changed from New to In Progress
- Assigned To changed from Peter Amstutz to Tom Clegg
14596-no-requeue @ 0731e1f9019fa841ef496e5e6e308e41deb7585a
Updated by Peter Amstutz almost 6 years ago
Tom Clegg wrote:
Could slurm be deciding to retry these jobs on its own? If so, maybe we can disable that behavior, but it might be impossible to avoid completely. As a backup plan we could also have crunch-run confirm state=Locked before doing anything else.
Checking "Locked" sounds like a good idea.
(An "Initializing" state between "Locked" and "Running" would also detect this, by preventing a backwards state change from "Running" to "Initializing")
Updated by Tom Clegg almost 6 years ago
--no-requeue
fix without updating crunch-dispatch-slurm, there are three other ways of telling slurm not to do this:
- Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc) and restart crunch-dispatch-slurm
- Add JobRequeue=0 to
slurm.conf
and restart slurm - Add "--no-requeue" to SbatchArguments in
/etc/arvados/crunch-dispatch-slurm/crunch-dispatch-slurm.yml
and restart crunch-dispatch-slurm
After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.
Updated by Tom Clegg almost 6 years ago
Peter Amstutz wrote:
(An "Initializing" state between "Locked" and "Running" would also detect this, by preventing a backwards state change from "Running" to "Initializing")
We already prevent backwards state changes from Running to Locked, so testing state==Locked tells us definitively whether the container has ever started. Setting state=Initializing seems like a hard way to do an easy thing.
Updated by Tom Clegg almost 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|d1571f495b0e0e05c833d4666924bcb6a288b33d.
Updated by Nico César almost 6 years ago
Tom Clegg wrote:
If you want to try the--no-requeue
fix without updating crunch-dispatch-slurm, there are two other ways of telling slurm not to do this:
- Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc)
- Add JobRequeue=0 to
slurm.conf
After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.
adding "--no-requeue " to sbatch_arguments in puppet, since ENV variables are not as easy and changing slurm.conf
will need to recreate a compute image and stopping the cluster
Updated by Nico César almost 6 years ago
Nico César wrote:
Tom Clegg wrote:
If you want to try the--no-requeue
fix without updating crunch-dispatch-slurm, there are two other ways of telling slurm not to do this:
- Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc)
- Add JobRequeue=0 to
slurm.conf
After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.
adding "--no-requeue " to sbatch_arguments in puppet, since ENV variables are not as easy and changing
slurm.conf
will need to recreate a compute image and stopping the cluster
done, it's in effect:
e51c5:~# scontrol show jobId=543917 | grep Requ Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0
Updated by Tom Clegg almost 6 years ago
- Status changed from Resolved to In Progress
14596-check-container-locked @ 81dd4a91b279b3229fb359df6c5dbf07571083ac
Updated by Lucas Di Pentima almost 6 years ago
Tom Clegg wrote:
14596-check-container-locked @ 81dd4a91b279b3229fb359df6c5dbf07571083ac
This also LGTM. Thanks!
Updated by Tom Clegg almost 6 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|cec011b7718536de42ebd683aa96bee92cbca06c.