Project

General

Profile

Actions

Bug #10701

closed

[Crunch2] crunch-dispatch-slurm leaks file descriptors

Added by Peter Amstutz almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Subtasks 2 (0 open2 closed)

Task #10741: ReviewResolvedPeter Amstutz01/31/2017Actions
Task #11086: Review 10701-refactor-dispatchResolvedTom Clegg01/31/2017Actions

Related issues

Related to Arvados - Bug #10700: [Crunch2] crunch-dispatch-slurm pileupResolvedTom Clegg01/27/2017Actions
Actions #1

Updated by Peter Amstutz almost 8 years ago

  • Subject changed from [Crunch2] crunch-dispatch-slurm possibly leaks file descriptors to [Crunch2] crunch-dispatch-slurm leaks file descriptors
Actions #2

Updated by Peter Amstutz almost 8 years ago

  • Target version set to 2017-01-04 sprint
Actions #3

Updated by Peter Amstutz almost 8 years ago

  • Assigned To set to Peter Amstutz
Actions #4

Updated by Peter Amstutz almost 8 years ago

  • Target version changed from 2017-01-04 sprint to 2017-01-18 sprint
Actions #5

Updated by Peter Amstutz almost 8 years ago

  • Target version changed from 2017-01-18 sprint to 2017-02-01 sprint
Actions #6

Updated by Tom Clegg almost 8 years ago

  • Assigned To changed from Peter Amstutz to Tom Clegg
Actions #7

Updated by Tom Clegg almost 8 years ago

Just before reaching max fds, server logs show many attempts to run sbatch. Some failed ("Error submitting container [...]: Container submission failed: [...]") but some seem to leave (or get stuck in?) submit() without saying anything after "sbatch starting":

2016-12-08_22:40:45.07549 2016/12/08 22:40:45 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100000" "--cpus-per-task=1"]
2016-12-08_22:40:45.15559 2016/12/08 22:40:45 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished

There is a bunch of unnecessary pipe-wrangling in in submit(), which I replaced with this:

    cmd.Stdin = strings.NewReader(...)

    var stdout, stderr bytes.Buffer
    cmd.Stdout = &stdout
    cmd.Stderr = &stderr

It only takes 100ms for c-d-s to make this transition, which seems fishy given that pollPeriod is 500ms -- squeue should run only 2x per second, and checkSqueue() is always supposed to wait for the next squeue result.

2016-12-08_22:40:45.07549 2016/12/08 22:40:45 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100000" "--cpus-per-task=1"]
2016-12-08_22:40:45.15559 2016/12/08 22:40:45 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished

There might be an opportunity for a race between run() and monitorSubmitOrCancel(), where run() sees the status channel close (and sets monitorDone=true) before monitorSubmitOrCancel() even has a chance to ask for the next squeue after running sbatch. dispatch.go might be shooting itself in the foot here:

    if container.State == Queued && dispatcher.checkMine(container, false) {
        // If we previously started the job, something failed, and it
        // was re-queued, this dispatcher might still be monitoring it.
        // Stop the existing monitor, then try to lock and run it
        // again.
        dispatcher.notMine(container.UUID)
    }

pollContainers() calls getContainers() three times in quick succession: first with paramsQ ("queued and ready"), then with paramsP ("locked by me"), then a third time with the union of matching container UUIDs from the first two. If the third query runs before handleUpdate() succeeds in locking one of the paramsQ results, that generates another handleUpdate() with state=Queued, which would hit the above "notMine" case.

I've eliminated this race by calling handleUpdate() synchronously from getContainers() instead of using a channel.

Actions #8

Updated by Tom Clegg almost 8 years ago

Tried running the old version of submit() many times with a command that fails or can't exec, but this doesn't reproduce the file descriptor leak.

Actions #9

Updated by Tom Clegg almost 8 years ago

  • Status changed from New to In Progress
Actions #10

Updated by Tom Clegg almost 8 years ago

  • Status changed from In Progress to Closed

Closing because there's a good chance it got fixed in #10700 by getting rid of a bunch of stdin/stdout/stderr pipes.

Actions #11

Updated by Tom Clegg almost 8 years ago

  • Status changed from Closed to In Progress
  • Target version changed from 2017-02-01 sprint to 2017-02-15 sprint
Actions #12

Updated by Tom Clegg almost 8 years ago

Problems:
  • sdk/go/dispatch assumes the dispatcher's RunContainer method is done as soon as the channel closes while in fact, multiple goroutines in crunch-dispatch-slurm might still be doing work.
  • after trying dispatcher.Unlock(), monitorSubmitOrCancel waits in sqCheck.HasUUID() for ~PollPeriod. During that interval, the main dispatcher loop is likely to see the container reappear in the queue, lock it, and start another goroutine to monitor it.
Proposed fixes:
  • sdk/go/dispatch should wait for RunContainer to return before deleting the container from mineMap.
  • crunch-dispatch-slurm should not return from run (RunContainer) until monitorSubmitOrCancel has also finished.
  • simplify crunch-dispatch-slurm by combining the two per-container event loops into one goroutine.
Actions #13

Updated by Tom Clegg almost 8 years ago

Actions #14

Updated by Radhika Chippada almost 8 years ago

In dispatch.go:

  • can we rename Dispatcher.running as “trackers” ?
  • In runningUUIDs, this comment is confusing: // API bug: ["uuid", "not in", []] does not match everything > The issue here is that no trackers are found? May be you can instead say since ‘[“uuid", "not in", []]’ does not work, use a dummy or something. Also, instead of “X” can you use “dummy” or something? Also, do we want to log the fact that there were no trackers?
  • In Run method for loop, can we get runningUUIDs() once and reuse?
  • tracker, running := d.running[c.UUID] > can we please rename “running” as “ok” here? With all the other "running" things, this was quite confusing.
Actions #15

Updated by Tom Clegg almost 8 years ago

Radhika Chippada wrote:

  • can we rename Dispatcher.running as “trackers” ?

Good call -- done.

  • In runningUUIDs, this comment is confusing: // API bug: ["uuid", "not in", []] does not match everything > The issue here is that no trackers are found? May be you can instead say since ‘[“uuid", "not in", []]’ does not work, use a dummy or something. Also, instead of “X” can you use “dummy” or something? Also, do we want to log the fact that there were no trackers?

Made the comment a bit less enigmatic, and changed "X" to "this-uuid-does-not-exist". len(trackers)0 just means nothing is running right now, which is normal, so I don't think we need to log it.

  • In Run method for loop, can we get runningUUIDs() once and reuse?

Sure thing. I thought it would be easier to follow if we used the most up-to-date list each time, but I don't suppose it matters much.

  • tracker, running := d.running[c.UUID] ==> can we please rename “running” as “ok” here? With all the other "running" things, this was quite confusing.

Renamed to "alreadyTracking".

Actions #16

Updated by Radhika Chippada almost 8 years ago

LGTM. Thanks.

Actions #17

Updated by Tom Clegg almost 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 50 to 100

Applied in changeset arvados|commit:c86cbaa6f286e50900dae3203a42044449e042f7.

Actions

Also available in: Atom PDF