Bug #10701
closed[Crunch2] crunch-dispatch-slurm leaks file descriptors
Related issues
Updated by Peter Amstutz almost 8 years ago
- Subject changed from [Crunch2] crunch-dispatch-slurm possibly leaks file descriptors to [Crunch2] crunch-dispatch-slurm leaks file descriptors
Updated by Peter Amstutz almost 8 years ago
- Target version set to 2017-01-04 sprint
Updated by Peter Amstutz almost 8 years ago
- Target version changed from 2017-01-04 sprint to 2017-01-18 sprint
Updated by Peter Amstutz almost 8 years ago
- Target version changed from 2017-01-18 sprint to 2017-02-01 sprint
Updated by Tom Clegg almost 8 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Tom Clegg almost 8 years ago
Just before reaching max fds, server logs show many attempts to run sbatch. Some failed ("Error submitting container [...]: Container submission failed: [...]") but some seem to leave (or get stuck in?) submit() without saying anything after "sbatch starting":
2016-12-08_22:40:45.07549 2016/12/08 22:40:45 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100000" "--cpus-per-task=1"] 2016-12-08_22:40:45.15559 2016/12/08 22:40:45 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished
There is a bunch of unnecessary pipe-wrangling in in submit(), which I replaced with this:
cmd.Stdin = strings.NewReader(...)
var stdout, stderr bytes.Buffer
cmd.Stdout = &stdout
cmd.Stderr = &stderr
It only takes 100ms for c-d-s to make this transition, which seems fishy given that pollPeriod is 500ms -- squeue should run only 2x per second, and checkSqueue() is always supposed to wait for the next squeue result.
2016-12-08_22:40:45.07549 2016/12/08 22:40:45 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100000" "--cpus-per-task=1"] 2016-12-08_22:40:45.15559 2016/12/08 22:40:45 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished
There might be an opportunity for a race between run() and monitorSubmitOrCancel(), where run() sees the status channel close (and sets monitorDone=true) before monitorSubmitOrCancel() even has a chance to ask for the next squeue after running sbatch. dispatch.go might be shooting itself in the foot here:
if container.State == Queued && dispatcher.checkMine(container, false) {
// If we previously started the job, something failed, and it
// was re-queued, this dispatcher might still be monitoring it.
// Stop the existing monitor, then try to lock and run it
// again.
dispatcher.notMine(container.UUID)
}
pollContainers() calls getContainers() three times in quick succession: first with paramsQ ("queued and ready"), then with paramsP ("locked by me"), then a third time with the union of matching container UUIDs from the first two. If the third query runs before handleUpdate() succeeds in locking one of the paramsQ results, that generates another handleUpdate() with state=Queued, which would hit the above "notMine" case.
I've eliminated this race by calling handleUpdate() synchronously from getContainers() instead of using a channel.
Updated by Tom Clegg almost 8 years ago
Tried running the old version of submit() many times with a command that fails or can't exec, but this doesn't reproduce the file descriptor leak.
Updated by Tom Clegg almost 8 years ago
- Status changed from In Progress to Closed
Closing because there's a good chance it got fixed in #10700 by getting rid of a bunch of stdin/stdout/stderr pipes.
Updated by Tom Clegg almost 8 years ago
- Status changed from Closed to In Progress
- Target version changed from 2017-02-01 sprint to 2017-02-15 sprint
Updated by Tom Clegg almost 8 years ago
- sdk/go/dispatch assumes the dispatcher's RunContainer method is done as soon as the channel closes while in fact, multiple goroutines in crunch-dispatch-slurm might still be doing work.
- after trying dispatcher.Unlock(), monitorSubmitOrCancel waits in sqCheck.HasUUID() for ~PollPeriod. During that interval, the main dispatcher loop is likely to see the container reappear in the queue, lock it, and start another goroutine to monitor it.
- sdk/go/dispatch should wait for RunContainer to return before deleting the container from mineMap.
- crunch-dispatch-slurm should not return from run (RunContainer) until monitorSubmitOrCancel has also finished.
- simplify crunch-dispatch-slurm by combining the two per-container event loops into one goroutine.
Updated by Tom Clegg almost 8 years ago
10701-refactor-dispatch @ a2fe6f9367de3ee93064fbee3f2df78ce84aa318
Updated by Radhika Chippada almost 8 years ago
In dispatch.go:
- can we rename Dispatcher.running as “trackers” ?
- In runningUUIDs, this comment is confusing: // API bug: ["uuid", "not in", []] does not match everything > The issue here is that no trackers are found? May be you can instead say since ‘[“uuid", "not in", []]’ does not work, use a dummy or something. Also, instead of “X” can you use “dummy” or something? Also, do we want to log the fact that there were no trackers?
- In Run method for loop, can we get runningUUIDs() once and reuse?
- tracker, running := d.running[c.UUID] > can we please rename “running” as “ok” here? With all the other "running" things, this was quite confusing.
Updated by Tom Clegg almost 8 years ago
Radhika Chippada wrote:
- can we rename Dispatcher.running as “trackers” ?
Good call -- done.
- In runningUUIDs, this comment is confusing: // API bug: ["uuid", "not in", []] does not match everything > The issue here is that no trackers are found? May be you can instead say since ‘[“uuid", "not in", []]’ does not work, use a dummy or something. Also, instead of “X” can you use “dummy” or something? Also, do we want to log the fact that there were no trackers?
Made the comment a bit less enigmatic, and changed "X" to "this-uuid-does-not-exist". len(trackers)0 just means nothing is running right now, which is normal, so I don't think we need to log it.
- In Run method for loop, can we get runningUUIDs() once and reuse?
Sure thing. I thought it would be easier to follow if we used the most up-to-date list each time, but I don't suppose it matters much.
- tracker, running := d.running[c.UUID] ==> can we please rename “running” as “ok” here? With all the other "running" things, this was quite confusing.
Renamed to "alreadyTracking".
Updated by Tom Clegg almost 8 years ago
- Status changed from In Progress to Resolved
- % Done changed from 50 to 100
Applied in changeset arvados|commit:c86cbaa6f286e50900dae3203a42044449e042f7.