Project

General

Profile

Bug #10700

Updated by Peter Amstutz over 7 years ago

Crunch-dispatch-slurm got stuck in a loop with a job it couldn't fulfill.    This is not reported to the user, so the user has no way of knowing what the problem is. 

 As a secondary issue, this condition also results in a resource leak of file descriptors.    When this happens, things start to pile up. 

 Finally, when it gets into this condition, a TERM signal stops processing of containers but it deadlocks and does not shut down gracefully. 

 <pre> 
 2016-12-08_22:40:56.21879 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started 
 2016-12-08_22:40:56.28660 2016/12/08 22:40:56 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w 
 orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: 
  Requested node configuration is not available\n") 
 2016-12-08_22:40:56.59802 2016/12/08 22:40:56 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr 
 2016-12-08_22:40:56.59806 2016/12/08 22:40:56 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100 
 000" "--cpus-per-task=1"] 
 2016-12-08_22:40:56.71585 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished 
 2016-12-08_22:40:56.21879 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started 
 2016-12-08_22:40:56.28660 2016/12/08 22:40:56 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w 
 orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: 
  Requested node configuration is not available\n") 
 2016-12-08_22:40:56.59802 2016/12/08 22:40:56 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr 
 2016-12-08_22:40:56.59806 2016/12/08 22:40:56 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100 
 000" "--cpus-per-task=1"] 
 2016-12-08_22:40:56.71585 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished 
 2016-12-08_22:40:56.88625 2016/12/08 22:40:56 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr started 
 2016-12-08_22:40:57.04602 2016/12/08 22:40:57 About to submit queued container 9tee4-dz642-5s5vdaeeu98qhbr 
 2016-12-08_22:40:57.04605 2016/12/08 22:40:57 sbatch starting: ["sbatch" "--share" "--workdir=/tmp" "--job-name=9tee4-dz642-5s5vdaeeu98qhbr" "--mem-per-cpu=100 
 000" "--cpus-per-task=1"] 
 2016-12-08_22:40:57.09960 2016/12/08 22:40:57 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --w 
 orkdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: 
  Requested node configuration is not available\n") 
 2016-12-08_22:40:57.20721 2016/12/08 22:40:57 Error locking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/lock: dial tcp 10.100.32.5:443: socket: too many open files" 
 2016-12-08_22:40:57.20723 2016/12/08 22:40:57 Monitoring container 9tee4-dz642-5s5vdaeeu98qhbr finished 
 2016-12-08_22:40:57.22284 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 
 2016-12-08_22:40:57.48667 2016/12/08 22:40:57 Error running squeue: fork/exec /usr/bin/squeue: too many open files 
 2016-12-08_22:40:57.63356 2016/12/08 22:40:57 Error locking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/lock: dial tcp 10.100.32.5:443: socket: too many open files" 
 2016-12-08_22:40:57.64150 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 
 2016-12-08_22:40:57.92863 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 
 2016-12-08_22:40:57.92868 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files 
 2016-12-08_22:40:57.92870 2016/12/08 22:40:57 Error submitting container 9tee4-dz642-5s5vdaeeu98qhbr to slurm: Container submission failed: [sbatch --share --workdir=/tmp --job-name=9tee4-dz642-5s5vdaeeu98qhbr --mem-per-cpu=100000 --cpus-per-task=1]: exit status 1 (stderr: "sbatch: error: Batch job submission failed: Requested node configuration is not available\n") 
 2016-12-08_22:40:57.92897 2016/12/08 22:40:57 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "Post https://9tee4.arvadosapi.com/arvados/v1/containers/9tee4-dz642-5s5vdaeeu98qhbr/unlock: dial tcp 10.100.32.5:443: socket: too many open files" 
 2016-12-08_22:40:57.98657 2016/12/08 22:40:57 Error creating stderr pipe for squeue: pipe2: too many open files 
 2016-12-08_22:40:58.04552 2016/12/08 22:40:58 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 
 2016-12-08_22:40:58.04554 2016/12/08 22:40:58 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: arvados API server error: #<ArvadosModel::InvalidStateTransitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com 
 . 
 . 
 . 
 2016-12-08_22:43:15.19059 016-12-08_22:43:15.19059 2016/12/08 22:43:15 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTra 
 nsitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 
 2016-12-08_22:43:15.87004 2016/12/08 22:43:15 Error unlocking container 9tee4-dz642-5s5vdaeeu98qhbr: "arvados API server error: #<ArvadosModel::InvalidStateTra 
 nsitionError: ArvadosModel::InvalidStateTransitionError> (422: 422 Unprocessable Entity) returned by 9tee4.arvadosapi.com" 
 2016-12-09_17:18:27.86690 2016/12/09 17:18:27 Caught signal: terminated 
 2016-12-09_19:40:16.28603 Stopping crunch-dispatch-slurm 
 </pre> 

Back