Bug #14596

[crunch-dispatch-slurm] Abandoned container starts again instead of being cancelled

Added by Tom Clegg 7 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Start date:
12/13/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

This seems to happen sometimes:
  1. Container is queued
  2. c-d-slurm submits a slurm job to run the container
  3. Something goes wrong -- perhaps the crunch-run process dies, or the slurm node goes down?
  4. c-d-slurm submits another slurm job, and the same container runs on a different node (we aren't certain c-d-slurm resubmits)

Subtasks

Task #14607: Review 14596-check-container-lockedResolvedTom Clegg

Associated revisions

Revision d1571f49
Added by Tom Clegg 7 months ago

Merge branch '14596-no-requeue'

fixes #14596

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision cec011b7
Added by Tom Clegg 7 months ago

Merge branch '14596-check-container-locked'

closes #14596

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg 7 months ago

  • Description updated (diff)

#2 Updated by Tom Morris 7 months ago

  • Target version changed from To Be Groomed to 2018-12-21 Sprint

#3 Updated by Peter Amstutz 7 months ago

  • Assigned To set to Peter Amstutz

#4 Updated by Peter Amstutz 7 months ago

If c-d-slurm finds a container with locked_by_uuid=self and state=Running that is not running according to slurm, it should cancel it instead of starting a new slurm job.

It does this already.

If c-d-slurm finds a container with locked_by_uuid=self and state=Locked that is not running according to slurm, it should unlock and re-lock the container (thereby invalidating/renewing auth_uuid) instead of submitting another slurm job using the same auth_uuid.

It does this already, too.

However, auth_uuid isn't used to update the container, the dispatcher token is. So invalidating the auth_uuid token doesn't guarantee that a lingering crunch-run will be prevented from updating the container.

#5 Updated by Tom Clegg 7 months ago

Can you note where in the code this happens? Problems are being reported in 1.2.0. Perhaps the bugs have been fixed since then. Otherwise we should go back and find a different explanation.

#6 Updated by Peter Amstutz 7 months ago

Tom Clegg wrote:

Can you note where in the code this happens? Problems are being reported in 1.2.0. Perhaps the bugs have been fixed since then. Otherwise we should go back and find a different explanation.

crunch-dispatch-slurm.go L339:

        case <-ctx.Done():
            // Disappeared from squeue
            if err := disp.Arv.Get("containers", ctr.UUID, nil, &ctr); err != nil {
                log.Printf("error getting final container state for %s: %s", ctr.UUID, err)
            }
            switch ctr.State {
            case dispatch.Running:
                disp.UpdateState(ctr.UUID, dispatch.Cancelled)
            case dispatch.Locked:
                disp.Unlock(ctr.UUID)
            }
            return

Going back in time with git blame, this logic has been there since at least early 2017.

#7 Updated by Tom Clegg 7 months ago

I see runContainer makes one attempt to cancel the container if it disappears from the slurm queue. But if that API call fails for any reason, it looks like c-d-s will completely forget about the container, and next time sdk/go/dispatch sees it in the Arvados queue, it will start it again.

Updated description to clarify.

#8 Updated by Tom Clegg 7 months ago

  • Description updated (diff)

#9 Updated by Tom Clegg 7 months ago

  • Description updated (diff)

#10 Updated by Tom Clegg 7 months ago

Hm. Looking at runContainer I don't see how runContainer would re-submit a slurm job for a container that already has state=Running. Still a missing piece to this puzzle?

#11 Updated by Tom Clegg 7 months ago

AFAIK we don't have access to the dispatcher logs for one of these problem cases, but we do have API logs that indicate
  1. update container record to state=Locked
  2. update container record to state=Running
  3. (container runs, fails)
  4. (crunch-run doesn't finalize the container record)
  5. (time passes)
  6. update container record to state=Running (was already state=Running)
  7. (container runs again)

Could slurm be deciding to retry these jobs on its own? If so, maybe we can disable that behavior, but it might be impossible to avoid completely. As a backup plan we could also have crunch-run confirm state=Locked before doing anything else.

#12 Updated by Tom Clegg 7 months ago

  • Description updated (diff)

#13 Updated by Tom Clegg 7 months ago

"Jobs may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. If JobRequeue is set to a value of 1, then batch job may be requeued unless explicitly disabled by the user. [...] The default value is 1." https://slurm.schedmd.com/slurm.conf.html

"--no-requeue
Specifies that the batch job should never be requeued under any circumstances. Setting this option will prevent system administrators from being able to restart the job (for example, after a scheduled downtime), recover from a node failure, or be requeued upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning." https://slurm.schedmd.com/sbatch.html

(On the cluster where this problem is seen, JobRequeue is not specified, so defaulting to 1.)

#14 Updated by Tom Clegg 7 months ago

  • Status changed from New to In Progress
  • Assigned To changed from Peter Amstutz to Tom Clegg

#15 Updated by Lucas Di Pentima 7 months ago

This LGTM, thanks.

#16 Updated by Peter Amstutz 7 months ago

Tom Clegg wrote:

Could slurm be deciding to retry these jobs on its own? If so, maybe we can disable that behavior, but it might be impossible to avoid completely. As a backup plan we could also have crunch-run confirm state=Locked before doing anything else.

Checking "Locked" sounds like a good idea.

(An "Initializing" state between "Locked" and "Running" would also detect this, by preventing a backwards state change from "Running" to "Initializing")

#17 Updated by Tom Clegg 7 months ago

If you want to try the --no-requeue fix without updating crunch-dispatch-slurm, there are three other ways of telling slurm not to do this:
  1. Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc) and restart crunch-dispatch-slurm
  2. Add JobRequeue=0 to slurm.conf and restart slurm
  3. Add "--no-requeue" to SbatchArguments in /etc/arvados/crunch-dispatch-slurm/crunch-dispatch-slurm.yml and restart crunch-dispatch-slurm

After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.

#18 Updated by Tom Clegg 7 months ago

Peter Amstutz wrote:

(An "Initializing" state between "Locked" and "Running" would also detect this, by preventing a backwards state change from "Running" to "Initializing")

We already prevent backwards state changes from Running to Locked, so testing state==Locked tells us definitively whether the container has ever started. Setting state=Initializing seems like a hard way to do an easy thing.

#19 Updated by Tom Clegg 7 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#20 Updated by Nico César 7 months ago

Tom Clegg wrote:

If you want to try the --no-requeue fix without updating crunch-dispatch-slurm, there are two other ways of telling slurm not to do this:
  1. Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc)
  2. Add JobRequeue=0 to slurm.conf

After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.

adding "--no-requeue " to sbatch_arguments in puppet, since ENV variables are not as easy and changing slurm.conf will need to recreate a compute image and stopping the cluster

#21 Updated by Nico César 7 months ago

Nico César wrote:

Tom Clegg wrote:

If you want to try the --no-requeue fix without updating crunch-dispatch-slurm, there are two other ways of telling slurm not to do this:
  1. Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc)
  2. Add JobRequeue=0 to slurm.conf

After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.

adding "--no-requeue " to sbatch_arguments in puppet, since ENV variables are not as easy and changing slurm.conf will need to recreate a compute image and stopping the cluster

done, it's in effect:

e51c5:~# scontrol show jobId=543917 | grep Requ
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0

#22 Updated by Tom Clegg 7 months ago

  • Status changed from Resolved to In Progress

14596-check-container-locked @ 81dd4a91b279b3229fb359df6c5dbf07571083ac

#23 Updated by Lucas Di Pentima 7 months ago

Tom Clegg wrote:

14596-check-container-locked @ 81dd4a91b279b3229fb359df6c5dbf07571083ac

This also LGTM. Thanks!

#24 Updated by Tom Clegg 7 months ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

#25 Updated by Tom Morris 5 months ago

  • Release set to 15

Also available in: Atom PDF