Project

General

Profile

Actions

Bug #13142

open

crunch-dispatch-slurm uses a surprising number of file descriptors, and stops dispatching when it gets a "too many open files" error

Added by Joshua Randall about 6 years ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

For about 3 minutes this morning (between 09:04:40 and 09:07:11), our crunch-dispatch-slurm reported 7565 "too many open files errors" in its log:

#  grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | wc -l
7565
# grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | cut -d" " -f8- | sort | uniq -c | sort -rg | head -n 10
   5059 scancel: /usr/bin/scancel: open /dev/null: too many open files ("")
    472 scancel: /usr/bin/scancel: pipe2: too many open files ("")
     98 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: open /dev/null: too many open files "" 
     21 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: pipe2: too many open files "" 
      6 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: fork/exec /usr/bin/squeue: too many open files "" 
      6 2 times: [ 2018/02/28 09:05:23 scancel: /usr/bin/scancel: open /dev/null: too many open files ("")]
      1 Error updating container ncucu-dz642-zwq2f3i5rj6r5wx to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zwq2f3i5rj6r5wx: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-zhsu1zqrsizyaxg to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zhsu1zqrsizyaxg: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-zeftzdg5kifnvcp to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zeftzdg5kifnvcp: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-yrv1f4husp85ivo to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-yrv1f4husp85ivo: dial tcp 10.101.0.179:443: socket: too many open files

The process has the default nofile limits, which is 1024 soft limit and 4096 hard limit.

We installed crunch-dispatch-slurm according to the instructions (http://doc.arvados.org/install/crunch2-slurm/install-dispatch.html) and it is being run by systemd using the unit installed by the apt package.
Version:

# dpkg -l |grep crunch
ii  crunch-dispatch-slurm            0.1.20171102145827.cc6f86f-1      amd64        Dispatch Crunch containers to a SLURM cluster

Between 09:07:11 and 17:17:27, crunch-dispatch-slurm continued to run and logged:
- 8594 "Done monitoring container <UUID>" lines.
- 37 "Error getting final container state for <UUID>: arvados API server error: Path not found (404: 404 Not Found) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.
- 32 "Error updating container <UUID> to state "Cancelled": arvados API server error: State cannot change from Complete to Cancelled (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.

Between 17:17:27 and 19:27:46, crunch-dispatch-slurm logged nothing (I think there was at that point nothing running).

At 19:27:46, I ran `systemctl restart crunch-dispatch-slurm` and it logged "crunch-dispatch-slurm dev started" and it then proceeded to submit 662 batch jobs.

It appears that the end result of running into the "too many files error" was that crunch-dispatch-slurm was going to continue to run but not process any new data. It would be better if it is in an errored state if it would terminate so that systemd can restart it automatically.

It would also be good if the systemd unit raises the crunch-dispatch-slurm nofile limit to an appropriate level.


Subtasks 1 (0 open1 closed)

Task #13293: Review 13142-dispatch-nfilesResolvedTom Clegg04/09/2018Actions
Actions

Also available in: Atom PDF