Bug #14596: [crunch-dispatch-slurm] Abandoned container starts again instead of being cancelled - Arvados

Actions

Copy link

Bug #14596

closed

[crunch-dispatch-slurm] Abandoned container starts again instead of being cancelled

Added by Tom Clegg over 6 years ago. Updated about 6 years ago.

Status:

Resolved

Priority:

Normal

Assigned To:

Tom Clegg

Category:

Crunch

Target version:

2018-12-21 Sprint

Story points:

Release:

Arvados v1.4 - Q1/Q2 2019

Release relationship:

Auto

Description

This seems to happen sometimes:

Container is queued
c-d-slurm submits a slurm job to run the container
Something goes wrong -- perhaps the crunch-run process dies, or the slurm node goes down?
~~c-d-slurm submits another slurm job, and~~ the same container runs on a different node (we aren't certain c-d-slurm resubmits)

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Morris over 6 years ago

Target version changed from To Be Groomed to 2018-12-21 Sprint

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

Assigned To set to Peter Amstutz

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

If c-d-slurm finds a container with locked_by_uuid=self and state=Running that is not running according to slurm, it should cancel it instead of starting a new slurm job.

It does this already.

If c-d-slurm finds a container with locked_by_uuid=self and state=Locked that is not running according to slurm, it should unlock and re-lock the container (thereby invalidating/renewing auth_uuid) instead of submitting another slurm job using the same auth_uuid.

It does this already, too.

However, auth_uuid isn't used to update the container, the dispatcher token is. So invalidating the auth_uuid token doesn't guarantee that a lingering crunch-run will be prevented from updating the container.

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Can you note where in the code this happens? Problems are being reported in 1.2.0. Perhaps the bugs have been fixed since then. Otherwise we should go back and find a different explanation.

Actions

Copy link

Updated by Peter Amstutz over 6 years ago

Tom Clegg wrote:

Can you note where in the code this happens? Problems are being reported in 1.2.0. Perhaps the bugs have been fixed since then. Otherwise we should go back and find a different explanation.

crunch-dispatch-slurm.go L339:

        case <-ctx.Done():
            // Disappeared from squeue
            if err := disp.Arv.Get("containers", ctr.UUID, nil, &ctr); err != nil {
                log.Printf("error getting final container state for %s: %s", ctr.UUID, err)
            }
            switch ctr.State {
            case dispatch.Running:
                disp.UpdateState(ctr.UUID, dispatch.Cancelled)
            case dispatch.Locked:
                disp.Unlock(ctr.UUID)
            }
            return

Going back in time with git blame, this logic has been there since at least early 2017.

Actions

Copy link

Updated by Tom Clegg over 6 years ago

I see runContainer makes one attempt to cancel the container if it disappears from the slurm queue. But if that API call fails for any reason, it looks like c-d-s will completely forget about the container, and next time sdk/go/dispatch sees it in the Arvados queue, it will start it again.

Updated description to clarify.

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#10

Updated by Tom Clegg over 6 years ago

Hm. Looking at runContainer I don't see how runContainer would re-submit a slurm job for a container that already has state=Running. Still a missing piece to this puzzle?

Actions

Copy link

#11

Updated by Tom Clegg over 6 years ago

AFAIK we don't have access to the dispatcher logs for one of these problem cases, but we do have API logs that indicate

update container record to state=Locked
update container record to state=Running
(container runs, fails)
(crunch-run doesn't finalize the container record)
(time passes)
update container record to state=Running (was already state=Running)
(container runs again)

Could slurm be deciding to retry these jobs on its own? If so, maybe we can disable that behavior, but it might be impossible to avoid completely. As a backup plan we could also have crunch-run confirm state=Locked before doing anything else.

Actions

Copy link

#12

Updated by Tom Clegg over 6 years ago

Description updated (diff)

Actions

Copy link

#13

Updated by Tom Clegg over 6 years ago

"Jobs may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. If JobRequeue is set to a value of 1, then batch job may be requeued unless explicitly disabled by the user. [...] The default value is 1." https://slurm.schedmd.com/slurm.conf.html

"--no-requeue
Specifies that the batch job should never be requeued under any circumstances. Setting this option will prevent system administrators from being able to restart the job (for example, after a scheduled downtime), recover from a node failure, or be requeued upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning." https://slurm.schedmd.com/sbatch.html

(On the cluster where this problem is seen, JobRequeue is not specified, so defaulting to 1.)

Actions

Copy link

#14

Updated by Tom Clegg over 6 years ago

Status changed from New to In Progress
Assigned To changed from Peter Amstutz to Tom Clegg

14596-no-requeue @ 0731e1f9019fa841ef496e5e6e308e41deb7585a

Actions

Copy link

#15

Updated by Lucas Di Pentima over 6 years ago

This LGTM, thanks.

Actions

Copy link

#16

Updated by Peter Amstutz over 6 years ago

Tom Clegg wrote:

Could slurm be deciding to retry these jobs on its own? If so, maybe we can disable that behavior, but it might be impossible to avoid completely. As a backup plan we could also have crunch-run confirm state=Locked before doing anything else.

Checking "Locked" sounds like a good idea.

(An "Initializing" state between "Locked" and "Running" would also detect this, by preventing a backwards state change from "Running" to "Initializing")

Actions

Copy link

#17

Updated by Tom Clegg over 6 years ago

If you want to try the --no-requeue fix without updating crunch-dispatch-slurm, there are three other ways of telling slurm not to do this:

Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc) and restart crunch-dispatch-slurm
Add JobRequeue=0 to slurm.conf and restart slurm
Add "--no-requeue" to SbatchArguments in /etc/arvados/crunch-dispatch-slurm/crunch-dispatch-slurm.yml and restart crunch-dispatch-slurm

After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.

Actions

Copy link

#18

Updated by Tom Clegg over 6 years ago

Peter Amstutz wrote:

(An "Initializing" state between "Locked" and "Running" would also detect this, by preventing a backwards state change from "Running" to "Initializing")

We already prevent backwards state changes from Running to Locked, so testing state==Locked tells us definitively whether the container has ever started. Setting state=Initializing seems like a hard way to do an easy thing.

Actions

Copy link

#19

Updated by Tom Clegg over 6 years ago

Status changed from In Progress to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|d1571f495b0e0e05c833d4666924bcb6a288b33d.

Actions

Copy link

#20

Updated by Nico César over 6 years ago

Tom Clegg wrote:

If you want to try the --no-requeue fix without updating crunch-dispatch-slurm, there are two other ways of telling slurm not to do this:

Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc)

Add JobRequeue=0 to slurm.conf

After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.

adding "--no-requeue " to sbatch_arguments in puppet, since ENV variables are not as easy and changing slurm.conf will need to recreate a compute image and stopping the cluster

Actions

Copy link

#21

Updated by Nico César over 6 years ago

Nico César wrote:

Tom Clegg wrote:

If you want to try the --no-requeue fix without updating crunch-dispatch-slurm, there are two other ways of telling slurm not to do this:

Set SBATCH_NO_REQUEUE=1 in crunch-dispatch-slurm's environment (via systemd override, runit script, etc)

Add JobRequeue=0 to slurm.conf

After updating one of these, wait for crunch-dispatch-slurm to submit a job, run "squeue" to get a job ID, and run "scontrol show jobid=123456 | grep Requeue". You should see Requeue=0.

adding "--no-requeue " to sbatch_arguments in puppet, since ENV variables are not as easy and changing slurm.conf will need to recreate a compute image and stopping the cluster

done, it's in effect:

e51c5:~# scontrol show jobId=543917 | grep Requ
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=1:0

Actions

Copy link

#22

Updated by Tom Clegg over 6 years ago

Status changed from Resolved to In Progress

14596-check-container-locked @ 81dd4a91b279b3229fb359df6c5dbf07571083ac

Actions

Copy link

#23

Updated by Lucas Di Pentima over 6 years ago

Tom Clegg wrote:

14596-check-container-locked @ 81dd4a91b279b3229fb359df6c5dbf07571083ac

This also LGTM. Thanks!

Actions

Copy link

#24

Updated by Tom Clegg over 6 years ago

Status changed from In Progress to Resolved
% Done changed from 0 to 100

Applied in changeset arvados|cec011b7718536de42ebd683aa96bee92cbca06c.

Actions

Copy link

#25

Updated by Tom Morris about 6 years ago

Release set to 15

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #14596

[crunch-dispatch-slurm] Abandoned container starts again instead of being cancelled

Updated by Tom Clegg over 6 years ago

Updated by Tom Morris over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Lucas Di Pentima over 6 years ago

Updated by Peter Amstutz over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Nico César over 6 years ago

Updated by Nico César over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Lucas Di Pentima over 6 years ago

Updated by Tom Clegg over 6 years ago

Updated by Tom Morris about 6 years ago