Project

General

Profile

Actions

Bug #10700

closed

[Crunch2] crunch-dispatch-slurm pileup

Added by Peter Amstutz over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

Crunch-dispatch-slurm got stuck in a loop with a job it couldn't fulfill. This is not reported to the user, so the user has no way of knowing what the problem is.

As a secondary issue, this condition also results in a resource leak of file descriptors. When this happens, things start to pile up.

Finally, when it gets into this condition, a TERM signal stops processing of containers but it deadlocks and does not shut down gracefully.

2016-12-08_22:40:56.21879 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started
2016-12-08_22:40:56.28660 2016/12/08 22:40:56 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w
orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed:
 Requested node configuration is not available\n")
2016-12-08_22:40:56.59802 2016/12/08 22:40:56 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr
2016-12-08_22:40:56.59806 2016/12/08 22:40:56 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100
000" "--cpus-per-task=1"]
2016-12-08_22:40:56.71585 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished
2016-12-08_22:40:56.21879 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started
2016-12-08_22:40:56.28660 2016/12/08 22:40:56 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w
orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed:
 Requested node configuration is not available\n")
2016-12-08_22:40:56.59802 2016/12/08 22:40:56 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr
2016-12-08_22:40:56.59806 2016/12/08 22:40:56 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100
000" "--cpus-per-task=1"]
2016-12-08_22:40:56.71585 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished
2016-12-08_22:40:56.88625 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started
2016-12-08_22:40:57.04602 2016/12/08 22:40:57 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr
2016-12-08_22:40:57.04605 2016/12/08 22:40:57 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100
000" "--cpus-per-task=1"]
2016-12-08_22:40:57.09960 2016/12/08 22:40:57 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w
orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed:
 Requested node configuration is not available\n")
2016-12-08_22:40:57.20721 2016/12/08 22:40:57 Error locking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/lock: dial tcp 10.100.32.5:443: socket: too many open files" 
2016-12-08_22:40:57.20723 2016/12/08 22:40:57 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished
2016-12-08_22:40:57.22284 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 
2016-12-08_22:40:57.48667 2016/12/08 22:40:57 Error running squeue: fork/exec /usr/bin/squeue: too many open files
2016-12-08_22:40:57.63356 2016/12/08 22:40:57 Error locking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/lock: dial tcp 10.100.32.5:443: socket: too many open files" 
2016-12-08_22:40:57.64150 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 
2016-12-08_22:40:57.92863 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 
2016-12-08_22:40:57.92868 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files
2016-12-08_22:40:57.92870 2016/12/08 22:40:57 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --workdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: Requested node configuration is not available\n")
2016-12-08_22:40:57.92897 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 
2016-12-08_22:40:57.98657 2016/12/08 22:40:57 Error creating stderr pipe for squeue: pipe2: too many open files
2016-12-08_22:40:58.04552 2016/12/08 22:40:58 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 
2016-12-08_22:40:58.04554 2016/12/08 22:40:58 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: arvados API server error: #<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com
.
.
.
2016-12-08_22:43:15.19059 2016/12/08 22:43:15 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTra
nsitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 
2016-12-08_22:43:15.87004 2016/12/08 22:43:15 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTra
nsitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 
2016-12-09_17:18:27.86690 2016/12/09 17:18:27 Caught signal: terminated
2016-12-09_19:40:16.28603 Stopping crunch-dispatch-slurm

Files

current (810 KB) current Peter Amstutz, 12/09/2016 08:15 PM

Subtasks 1 (0 open1 closed)

Task #10991: Review 10700-dispatchResolvedPeter Amstutz01/27/2017Actions

Related issues

Related to Arvados - Bug #10701: [Crunch2] crunch-dispatch-slurm leaks file descriptorsResolvedTom Clegg01/31/2017Actions
Related to Arvados - Bug #10702: [Crunch2] crunch-dispatch-slurm buggy error handlingResolvedTom Clegg02/01/2017Actions
Related to Arvados - Bug #10703: [Crunch2] crunch-dispatch-slurm deadlocks instead of graceful shutdownResolvedTom Clegg01/31/2017Actions
Related to Arvados - Bug #10704: [Crunch2] sbatch submit failures not reported to use, loop foreverResolvedTom Clegg01/31/2017Actions
Related to Arvados - Bug #10705: [Crunch2] [API] return a more specific 422 error message when a client calls containers#unlock without having the lockNewActions
Related to Arvados - Bug #10729: [Crunch2] Propagate error messages if sbatch command succeeds but crunch-run can't run (or can't log to the Arvados API)NewActions
Actions

Also available in: Atom PDF