Project

General

Profile

Actions

Bug #11626

closed

[Crunch2] Still not propagating slurm errors to user

Added by Peter Amstutz almost 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-

Description

If a user submits a job and slurm rejects it, the slurm error should be propagated to the container log so the user can see what is wrong.


Subtasks 1 (0 open1 closed)

Task #11633: Review 11626-crunch2-log-slurm-errorsResolvedPeter Amstutz05/08/2017Actions

Related issues

Related to Arvados - Bug #11669: [Crunch2] crunch-dispatch-slurm hits scancel too muchIn ProgressActions
Actions #1

Updated by Peter Amstutz almost 7 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 7 years ago

  • Target version set to 2017-05-10 sprint
Actions #3

Updated by Peter Amstutz almost 7 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Peter Amstutz almost 7 years ago

  • Assigned To set to Peter Amstutz
Actions #5

Updated by Tom Clegg almost 7 years ago

I'm not sure why the HasUUID() check is added in run() in crunch-dispatch-slurm.go. ctx.Done() fires as soon as HasUUID() returns false, so it seems like it could only make a difference during a race between <-status and <-ctx.Done() -- the rest of the time it just introduces an unnecessary polling delay between dispatch telling crunch-dispatch-slurm to cancel and c-d-s actually running scancel.

Also not sure why we're adding the extra call to tracker.update(c) before tracker.close(). The dispatch library is supposed to reduce the amount of state/priority logic each dispatcher has to implement -- if the idea here is just to log every state change, perhaps it should go in dispatch.go instead?

event_type="dispatch" might be better than "crunch-dispatch-slurm" for filtering.

In TestIntegrationNormal, why is the *bool better than a bool?

Actions #6

Updated by Peter Amstutz almost 7 years ago

However I've rebased out the HasUUID changes so we can discuss it separately.

Now at @ f8675ad473b45387b1286c6b7a41edf36148ebac

(note-5 refering to 26d7e1809376534b9060534e25399cbd462f38da)

Hhe problem I was trying to solve was that the logs are full of this:

2017-05-08_14:09:51.64626 2017/05/08 14:09:51 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:09:52.35643 2017/05/08 14:09:52 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:09:53.35654 2017/05/08 14:09:53 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:09:54.35664 2017/05/08 14:09:54 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:09:55.35680 2017/05/08 14:09:55 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:09:56.35509 2017/05/08 14:09:56 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:09:57.35538 2017/05/08 14:09:57 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:09:58.35661 2017/05/08 14:09:58 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:09:59.35681 2017/05/08 14:09:59 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:10:00.35669 2017/05/08 14:10:00 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:10:01.35804 2017/05/08 14:10:01 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:10:02.35522 2017/05/08 14:10:02 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:10:03.35664 2017/05/08 14:10:03 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:10:04.35882 2017/05/08 14:10:04 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:10:05.35906 2017/05/08 14:10:05 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:10:06.36192 2017/05/08 14:10:06 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:10:07.36220 2017/05/08 14:10:07 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:10:08.36188 2017/05/08 14:10:08 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel
2017-05-08_14:10:09.36211 2017/05/08 14:10:09 Dispatcher says container qr1hi-dz642-tjezupwylh67k2b is done: cancel slurm job
2017-05-08_14:10:10.35595 2017/05/08 14:10:10 container qr1hi-dz642-tjezupwylh67k2b is still in squeue after scancel

This is because once the channel is closed, it just keeps hitting scancel() with no delay in between. So I thought it could at least wait for the next HasUUID() broadcast.

Actions #7

Updated by Peter Amstutz almost 7 years ago

  • Target version changed from 2017-05-10 sprint to 2017-05-24 sprint
Actions #8

Updated by Tom Clegg almost 7 years ago

Peter Amstutz wrote:

Hhe problem I was trying to solve was that the logs are full of this:

[...]

This is because once the channel is closed, it just keeps hitting scancel() with no delay in between. So I thought it could at least wait for the next HasUUID() broadcast.

It looks to me like there's a 1-second delay. This delay is inserted by scancel(). The reason it keeps hitting scancel() is that the mocked squeue keeps reporting that the slurm job is still running.

I think the correct way to get rid of those messages would be to fix the squeue mock, so it mocks the behavior we're hoping to test against, instead of a slurm failure mode that makes us (correctly) produce a bunch of debug logs.

Actions #9

Updated by Tom Clegg almost 7 years ago

oops, just noticed

log.Printf(text)

should be

log.Print(text)

Other than that, LGTM @ f8675ad473b45387b1286c6b7a41edf36148ebac

(as tempting as it is to ask for an error check on disp.Arv.Create() so we can log "error logging error: ...")

Actions #10

Updated by Tom Clegg almost 7 years ago

I missed that the example logs above are from a real site, not the test suite. So the mystery is why scancel wasn't effective.

AFAICT crunch-dispatch-slurm was doing the right thing in the circumstances: keep hitting scancel, with a ≥1 second delay between attempts, until the job goes away.

In any case it seems like a separate topic from "propagate slurm errors to user", because at this point the container is already done and the requesting user probably has little interest in (and ability to fix) slurm problems that happen later. So I think if this scancel behavior needs attention, we should open a new issue. discuss further on #11669.

Actions #11

Updated by Peter Amstutz almost 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:a309f8c5b4842075d6c83f99a8f2a1e1016976f5.

Actions

Also available in: Atom PDF