Bug #13142
opencrunch-dispatch-slurm uses a surprising number of file descriptors, and stops dispatching when it gets a "too many open files" error
Description
For about 3 minutes this morning (between 09:04:40 and 09:07:11), our crunch-dispatch-slurm reported 7565 "too many open files errors" in its log:
# grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | wc -l 7565 # grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | cut -d" " -f8- | sort | uniq -c | sort -rg | head -n 10 5059 scancel: /usr/bin/scancel: open /dev/null: too many open files ("") 472 scancel: /usr/bin/scancel: pipe2: too many open files ("") 98 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: open /dev/null: too many open files "" 21 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: pipe2: too many open files "" 6 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: fork/exec /usr/bin/squeue: too many open files "" 6 2 times: [ 2018/02/28 09:05:23 scancel: /usr/bin/scancel: open /dev/null: too many open files ("")] 1 Error updating container ncucu-dz642-zwq2f3i5rj6r5wx to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zwq2f3i5rj6r5wx: dial tcp 10.101.0.179:443: socket: too many open files 1 Error updating container ncucu-dz642-zhsu1zqrsizyaxg to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zhsu1zqrsizyaxg: dial tcp 10.101.0.179:443: socket: too many open files 1 Error updating container ncucu-dz642-zeftzdg5kifnvcp to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zeftzdg5kifnvcp: dial tcp 10.101.0.179:443: socket: too many open files 1 Error updating container ncucu-dz642-yrv1f4husp85ivo to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-yrv1f4husp85ivo: dial tcp 10.101.0.179:443: socket: too many open files
The process has the default nofile limits, which is 1024 soft limit and 4096 hard limit.
We installed crunch-dispatch-slurm according to the instructions (http://doc.arvados.org/install/crunch2-slurm/install-dispatch.html) and it is being run by systemd using the unit installed by the apt package.
Version:
# dpkg -l |grep crunch ii crunch-dispatch-slurm 0.1.20171102145827.cc6f86f-1 amd64 Dispatch Crunch containers to a SLURM cluster
Between 09:07:11 and 17:17:27, crunch-dispatch-slurm continued to run and logged:
- 8594 "Done monitoring container <UUID>" lines.
- 37 "Error getting final container state for <UUID>: arvados API server error: Path not found (404: 404 Not Found) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.
- 32 "Error updating container <UUID> to state "Cancelled": arvados API server error: State cannot change from Complete to Cancelled (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.
Between 17:17:27 and 19:27:46, crunch-dispatch-slurm logged nothing (I think there was at that point nothing running).
At 19:27:46, I ran `systemctl restart crunch-dispatch-slurm` and it logged "crunch-dispatch-slurm dev started" and it then proceeded to submit 662 batch jobs.
It appears that the end result of running into the "too many files error" was that crunch-dispatch-slurm was going to continue to run but not process any new data. It would be better if it is in an errored state if it would terminate so that systemd can restart it automatically.
It would also be good if the systemd unit raises the crunch-dispatch-slurm nofile limit to an appropriate level.
Updated by Joshua Randall almost 7 years ago
I have manually raised max open files by creating a systemd drop-in:
# printf 'a\n[Service]\nLimitNOFILE=1000000\n.\nw\nq' | EDITOR=ed systemctl edit crunch-dispatch-slurm # systemctl restart crunch-dispatch-slurm # cat /proc/$(ps auxwww |grep crunch-dispatch-slurm | grep -v grep | awk '{print $2}')/limits | grep files Max open files 1000000 1000000 files
Updated by Tom Clegg over 6 years ago
We should check for leaking FDs. We might be relying on garbage collection to clean up abandoned exec.Command fds.
Updated by Tom Clegg over 6 years ago
- we call Wait (or something that calls Wait) every time we use exec.Command
- we use one arvados client object for all api calls
- a dispatcher that's been running jobs all day has 4 FDs open
...and we only run one scancel at a time, so I'm guessing this happens during a busy time (lots of sbatch invocations).
I suppose we could rate-limit sbatch, but either way, 1024 is a silly limit anyway for a program like this, so:
13142-dispatch-nfiles @ 580949047505722b11956f09177a8537a1fef015
Updated by Lucas Di Pentima over 6 years ago
Not much to review :) this LGTM, thanks.
Updated by Tom Clegg over 6 years ago
- Subject changed from crunch-dispatch-slurm systemd unit does not set nofile limit, and crunch-dispatch-slurm does not handle "too many open files" errors well to crunch-dispatch-slurm uses a surprising number of file descriptors, and stops dispatching when it gets a "too many open files" error
- Status changed from In Progress to New
- Assigned To deleted (
Tom Clegg) - Target version changed from 2018-04-11 Sprint to Arvados Future Sprints
Updated by Peter Amstutz over 3 years ago
- Target version deleted (
Arvados Future Sprints)