Bug #10700
closed[Crunch2] crunch-dispatch-slurm pileup
Description
Crunch-dispatch-slurm got stuck in a loop with a job it couldn't fulfill. This is not reported to the user, so the user has no way of knowing what the problem is.
As a secondary issue, this condition also results in a resource leak of file descriptors. When this happens, things start to pile up.
Finally, when it gets into this condition, a TERM signal stops processing of containers but it deadlocks and does not shut down gracefully.
2016-12-08_22:40:56.21879 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started 2016-12-08_22:40:56.28660 2016/12/08 22:40:56 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: Requested node configuration is not available\n") 2016-12-08_22:40:56.59802 2016/12/08 22:40:56 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr 2016-12-08_22:40:56.59806 2016/12/08 22:40:56 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100 000" "--cpus-per-task=1"] 2016-12-08_22:40:56.71585 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished 2016-12-08_22:40:56.21879 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started 2016-12-08_22:40:56.28660 2016/12/08 22:40:56 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: Requested node configuration is not available\n") 2016-12-08_22:40:56.59802 2016/12/08 22:40:56 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr 2016-12-08_22:40:56.59806 2016/12/08 22:40:56 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100 000" "--cpus-per-task=1"] 2016-12-08_22:40:56.71585 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished 2016-12-08_22:40:56.88625 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started 2016-12-08_22:40:57.04602 2016/12/08 22:40:57 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr 2016-12-08_22:40:57.04605 2016/12/08 22:40:57 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100 000" "--cpus-per-task=1"] 2016-12-08_22:40:57.09960 2016/12/08 22:40:57 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: Requested node configuration is not available\n") 2016-12-08_22:40:57.20721 2016/12/08 22:40:57 Error locking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/lock: dial tcp 10.100.32.5:443: socket: too many open files" 2016-12-08_22:40:57.20723 2016/12/08 22:40:57 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished 2016-12-08_22:40:57.22284 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 2016-12-08_22:40:57.48667 2016/12/08 22:40:57 Error running squeue: fork/exec /usr/bin/squeue: too many open files 2016-12-08_22:40:57.63356 2016/12/08 22:40:57 Error locking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/lock: dial tcp 10.100.32.5:443: socket: too many open files" 2016-12-08_22:40:57.64150 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 2016-12-08_22:40:57.92863 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 2016-12-08_22:40:57.92868 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files 2016-12-08_22:40:57.92870 2016/12/08 22:40:57 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --workdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: Requested node configuration is not available\n") 2016-12-08_22:40:57.92897 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 2016-12-08_22:40:57.98657 2016/12/08 22:40:57 Error creating stderr pipe for squeue: pipe2: too many open files 2016-12-08_22:40:58.04552 2016/12/08 22:40:58 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 2016-12-08_22:40:58.04554 2016/12/08 22:40:58 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: arvados API server error: #<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com . . . 2016-12-08_22:43:15.19059 2016/12/08 22:43:15 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTra nsitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 2016-12-08_22:43:15.87004 2016/12/08 22:43:15 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTra nsitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 2016-12-09_17:18:27.86690 2016/12/09 17:18:27 Caught signal: terminated 2016-12-09_19:40:16.28603 Stopping crunch-dispatch-slurm
Files
Updated by Tom Morris about 8 years ago
- Assigned To set to Tom Morris
- Target version set to Arvados Future Sprints
Updated by Peter Amstutz about 8 years ago
- Target version changed from Arvados Future Sprints to 2017-01-04 sprint
Updated by Tom Morris almost 8 years ago
- Assigned To changed from Tom Morris to Peter Amstutz
Updated by Peter Amstutz almost 8 years ago
- Target version changed from 2017-01-04 sprint to 2017-01-18 sprint
Updated by Peter Amstutz almost 8 years ago
- Target version changed from 2017-01-18 sprint to 2017-02-01 sprint
Updated by Tom Clegg almost 8 years ago
- Assigned To changed from Peter Amstutz to Tom Clegg
Updated by Tom Clegg almost 8 years ago
- 10701 (leaking fds) → 318dd88 10701: Remove unneeded complexity in squeue invocation. (this is an improvement that might be related, but not convincing as a bug fix)
- 10702 (buggy error handling) → e356309 10703: Do not catch signals in crunch-dispatch-slurm. Simplify "stop dispatcher loop" API. *(simplifying the dispatcher loop fixes multiple concurrent attempts of the same container and resulting log confusion)
- 10703 (deadlock) → e356309 again (SIGTERM just exits immediately now)
- 10704 (sbatch loops too fast) → ae5eb12 10704: Rate-limit startup attempts per container.
- 10705 (better error message for "Unlock unlocked container") → not addressed yet
Couple of other cleanup things included too as separate commits.
Updated by Peter Amstutz almost 8 years ago
Tom Clegg wrote:
10700-dispatch @ e34a5060cfc1cc4821b431e8aa6778a31898e0eb
- 10701 (leaking fds) → 318dd88 10701: Remove unneeded complexity in squeue invocation. (this is an improvement that might be related, but not convincing as a bug fix)
Much nicer. I have to confess I haven't been writing enough Go code lately and have been missing some of the subtle tricks.
- 10702 (buggy error handling) → e356309 10703: Do not catch signals in crunch-dispatch-slurm. Simplify "stop dispatcher loop" API. *(simplifying the dispatcher loop fixes multiple concurrent attempts of the same container and resulting log confusion)
- 10703 (deadlock) → e356309 again (SIGTERM just exits immediately now)
This seems to be the primary bug. This seems to explain the flip-flopping between monitored / unmonitored and the resulting pile up of redundant squeue invocations which is consistent with my initial analysis.
- 10704 (sbatch loops too fast) → ae5eb12 10704: Rate-limit startup attempts per container.
IIRC the Go "Ticker" sends a heartbeat every N seconds so for N=10 and it takes 9 seconds to do something, you'll only get 1 second of sleep before getting another tick. So squeue could still get called twice in < PollPeriod. Suggest MinRetryPeriod should have the same default as PollPeriod, instead of being disabled by default.
- 10705 (better error message for "Unlock unlocked container") → not addressed yet
Couple of other cleanup things included too as separate commits.
683f537 → this is just an argument of code style. In the pursuit of reducing indentation, the conditional has been transformed from "do this if X is true" to "don't do this if X is not true" which is much harder to reason about. Maybe at least clarify it with comments?
Updated by Tom Clegg almost 8 years ago
Peter Amstutz wrote:
This seems to be the primary bug. This seems to explain the flip-flopping between monitored / unmonitored and the resulting pile up of redundant squeue invocations which is consistent with my initial analysis.
I don't remember seeing anything about redundant invocations. Is this written down somewhere? AFAICT the squeue goroutine can't run more than one squeue at a time.
I see I wrote "squeue" in the commit message about simplifying exec.Cmd usage, but "sbatch" was the thing I actually fixed. So... did you also really mean "sbatch" here?
- 10704 (sbatch loops too fast) → ae5eb12 10704: Rate-limit startup attempts per container.
IIRC the Go "Ticker" sends a heartbeat every N seconds so for N=10 and it takes 9 seconds to do something, you'll only get 1 second of sleep before getting another tick. So squeue could still get called twice in < PollPeriod. Suggest MinRetryPeriod should have the same default as PollPeriod, instead of being disabled by default.
MinRetryPeriod only affects how soon a container will be resubmitted with sbatch after disappearing from the slurm queue, and doesn't use a ticker.
I don't think it's a problem for squeue to be called twice in < PollPeriod -- are you seeing something I'm not?
I do see squeue.go could use some of the same exec.Cmd simplifying treatment noted above, though. I'll do that. (I don't see an FD leak there either, unfortunately.)
683f537 → this is just an argument of code style. In the pursuit of reducing indentation, the conditional has been transformed from "do this if X is true" to "don't do this if X is not true" which is much harder to reason about. Maybe at least clarify it with comments?
Fair point. Backed this out half way -- now it's "if A && (B || C)" instead of two nested ifs.
Updated by Tom Clegg almost 8 years ago
Refactored the squeue stuff, (I think) it's much simpler now.
I didn't see any leaking FDs in the old code, but I figure it can't hurt to get rid of all the pipes and stuff.
10700-dispatch @ a19ffad966b25b3869e666f749f7c6da187bef68
Updated by Peter Amstutz almost 8 years ago
Tom Clegg wrote:
Peter Amstutz wrote:
This seems to be the primary bug. This seems to explain the flip-flopping between monitored / unmonitored and the resulting pile up of redundant squeue invocations which is consistent with my initial analysis.
I don't remember seeing anything about redundant invocations. Is this written down somewhere? AFAICT the squeue goroutine can't run more than one squeue at a time.
I see I wrote "squeue" in the commit message about simplifying exec.Cmd usage, but "sbatch" was the thing I actually fixed. So... did you also really mean "sbatch" here?
Yes, I meant sbatch. I have to review the logs, but I believe it was getting into a situation of running overlapping sbatch commands for the same container. Which would explain the FD leak if there was a bunch of stuck goroutines with pipes open.
- 10704 (sbatch loops too fast) → ae5eb12 10704: Rate-limit startup attempts per container.
IIRC the Go "Ticker" sends a heartbeat every N seconds so for N=10 and it takes 9 seconds to do something, you'll only get 1 second of sleep before getting another tick. So squeue could still get called twice in < PollPeriod. Suggest MinRetryPeriod should have the same default as PollPeriod, instead of being disabled by default.
MinRetryPeriod only affects how soon a container will be resubmitted with sbatch after disappearing from the slurm queue, and doesn't use a ticker.
Same confusion, I meant sbatch. I meant that if the intended default behavior is to rate-limit sbatch invocations to at least PollPeriod, with throttling disabled the actual time between two invocations could be less than PollPeriod because it could be run at the end of the previous period and again at the start of the next period. Setting throttle == PollPeriod (instead of 0) ensures that PollPeriod is the floor instead of the ceiling.
Rest LGTM.
Updated by Tom Clegg almost 8 years ago
Peter Amstutz wrote:
Yes, I meant sbatch. I have to review the logs, but I believe it was getting into a situation of running overlapping sbatch commands for the same container. Which would explain the FD leak if there was a bunch of stuck goroutines with pipes open.
OK. I looked into this on #10701. I didn't find that sbatch was getting stuck. It was possible for monitorSubmitOrCancel() to return before it even had a chance to check the slurm queue, though.
MinRetryPeriod only affects how soon a container will be resubmitted with sbatch after disappearing from the slurm queue, and doesn't use a ticker.
Same confusion, I meant sbatch. I meant that if the intended default behavior is to rate-limit sbatch invocations to at least PollPeriod, with throttling disabled the actual time between two invocations could be less than PollPeriod because it could be run at the end of the previous period and again at the start of the next period. Setting throttle == PollPeriod (instead of 0) ensures that PollPeriod is the floor instead of the ceiling.
Ah.
The intended default behavior is merely to behave correctly, i.e., avoid overlapping sbatch calls. In the default case, MinRetryPeriod is 0 and there is no effort (or need, afaik) to enforce a minimum time between two consecutive attempts. PollPeriod just establishes a minimum average time between attempts.
If you do want a minimum interval between sbatch attempts, you can use MinRetryPeriod.
Updated by Tom Clegg almost 8 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|commit:c3d4f8a585202ec58df5506934b698039c200b68.