Bug #13142
opencrunch-dispatch-slurm uses a surprising number of file descriptors, and stops dispatching when it gets a "too many open files" error
Description
For about 3 minutes this morning (between 09:04:40 and 09:07:11), our crunch-dispatch-slurm reported 7565 "too many open files errors" in its log:
# grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | wc -l 7565 # grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | cut -d" " -f8- | sort | uniq -c | sort -rg | head -n 10 5059 scancel: /usr/bin/scancel: open /dev/null: too many open files ("") 472 scancel: /usr/bin/scancel: pipe2: too many open files ("") 98 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: open /dev/null: too many open files "" 21 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: pipe2: too many open files "" 6 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: fork/exec /usr/bin/squeue: too many open files "" 6 2 times: [ 2018/02/28 09:05:23 scancel: /usr/bin/scancel: open /dev/null: too many open files ("")] 1 Error updating container ncucu-dz642-zwq2f3i5rj6r5wx to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zwq2f3i5rj6r5wx: dial tcp 10.101.0.179:443: socket: too many open files 1 Error updating container ncucu-dz642-zhsu1zqrsizyaxg to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zhsu1zqrsizyaxg: dial tcp 10.101.0.179:443: socket: too many open files 1 Error updating container ncucu-dz642-zeftzdg5kifnvcp to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zeftzdg5kifnvcp: dial tcp 10.101.0.179:443: socket: too many open files 1 Error updating container ncucu-dz642-yrv1f4husp85ivo to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-yrv1f4husp85ivo: dial tcp 10.101.0.179:443: socket: too many open files
The process has the default nofile limits, which is 1024 soft limit and 4096 hard limit.
We installed crunch-dispatch-slurm according to the instructions (http://doc.arvados.org/install/crunch2-slurm/install-dispatch.html) and it is being run by systemd using the unit installed by the apt package.
Version:
# dpkg -l |grep crunch ii crunch-dispatch-slurm 0.1.20171102145827.cc6f86f-1 amd64 Dispatch Crunch containers to a SLURM cluster
Between 09:07:11 and 17:17:27, crunch-dispatch-slurm continued to run and logged:
- 8594 "Done monitoring container <UUID>" lines.
- 37 "Error getting final container state for <UUID>: arvados API server error: Path not found (404: 404 Not Found) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.
- 32 "Error updating container <UUID> to state "Cancelled": arvados API server error: State cannot change from Complete to Cancelled (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.
Between 17:17:27 and 19:27:46, crunch-dispatch-slurm logged nothing (I think there was at that point nothing running).
At 19:27:46, I ran `systemctl restart crunch-dispatch-slurm` and it logged "crunch-dispatch-slurm dev started" and it then proceeded to submit 662 batch jobs.
It appears that the end result of running into the "too many files error" was that crunch-dispatch-slurm was going to continue to run but not process any new data. It would be better if it is in an errored state if it would terminate so that systemd can restart it automatically.
It would also be good if the systemd unit raises the crunch-dispatch-slurm nofile limit to an appropriate level.