Bug #13142

crunch-dispatch-slurm uses a surprising number of file descriptors, and stops dispatching when it gets a "too many open files" error

Added by Joshua Randall 10 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
04/09/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

For about 3 minutes this morning (between 09:04:40 and 09:07:11), our crunch-dispatch-slurm reported 7565 "too many open files errors" in its log:

#  grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | wc -l
7565
# grep 'arvados-master-ncucu crunch-dispatch-slurm' /var/log/syslog | grep 'too many' | cut -d" " -f8- | sort | uniq -c | sort -rg | head -n 10
   5059 scancel: /usr/bin/scancel: open /dev/null: too many open files ("")
    472 scancel: /usr/bin/scancel: pipe2: too many open files ("")
     98 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: open /dev/null: too many open files "" 
     21 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: pipe2: too many open files "" 
      6 Error running "/usr/bin/squeue" ["squeue" "--all" "--format=%j %y %Q"]: fork/exec /usr/bin/squeue: too many open files "" 
      6 2 times: [ 2018/02/28 09:05:23 scancel: /usr/bin/scancel: open /dev/null: too many open files ("")]
      1 Error updating container ncucu-dz642-zwq2f3i5rj6r5wx to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zwq2f3i5rj6r5wx: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-zhsu1zqrsizyaxg to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zhsu1zqrsizyaxg: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-zeftzdg5kifnvcp to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-zeftzdg5kifnvcp: dial tcp 10.101.0.179:443: socket: too many open files
      1 Error updating container ncucu-dz642-yrv1f4husp85ivo to state "Cancelled": Put https://arvados-api-ncucu.hgi.sanger.ac.uk/arvados/v1/containers/ncucu-dz642-yrv1f4husp85ivo: dial tcp 10.101.0.179:443: socket: too many open files

The process has the default nofile limits, which is 1024 soft limit and 4096 hard limit.

We installed crunch-dispatch-slurm according to the instructions (http://doc.arvados.org/install/crunch2-slurm/install-dispatch.html) and it is being run by systemd using the unit installed by the apt package.
Version:

# dpkg -l |grep crunch
ii  crunch-dispatch-slurm            0.1.20171102145827.cc6f86f-1      amd64        Dispatch Crunch containers to a SLURM cluster

Between 09:07:11 and 17:17:27, crunch-dispatch-slurm continued to run and logged:
- 8594 "Done monitoring container <UUID>" lines.
- 37 "Error getting final container state for <UUID>: arvados API server error: Path not found (404: 404 Not Found) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.
- 32 "Error updating container <UUID> to state "Cancelled": arvados API server error: State cannot change from Complete to Cancelled (422: 422 Unprocessable Entity) returned by arvados-api-ncucu.hgi.sanger.ac.uk" lines.

Between 17:17:27 and 19:27:46, crunch-dispatch-slurm logged nothing (I think there was at that point nothing running).

At 19:27:46, I ran `systemctl restart crunch-dispatch-slurm` and it logged "crunch-dispatch-slurm dev started" and it then proceeded to submit 662 batch jobs.

It appears that the end result of running into the "too many files error" was that crunch-dispatch-slurm was going to continue to run but not process any new data. It would be better if it is in an errored state if it would terminate so that systemd can restart it automatically.

It would also be good if the systemd unit raises the crunch-dispatch-slurm nofile limit to an appropriate level.


Subtasks

Task #13293: Review 13142-dispatch-nfilesResolvedTom Clegg

Associated revisions

Revision 86a14d03
Added by Tom Clegg 8 months ago

Merge branch '13142-dispatch-nfiles'

refs #13142

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Joshua Randall 10 months ago

I have manually raised max open files by creating a systemd drop-in:

# printf 'a\n[Service]\nLimitNOFILE=1000000\n.\nw\nq' | EDITOR=ed systemctl edit crunch-dispatch-slurm
# systemctl restart crunch-dispatch-slurm
# cat /proc/$(ps auxwww |grep crunch-dispatch-slurm | grep -v grep | awk '{print $2}')/limits | grep files
Max open files            1000000              1000000              files

#2 Updated by Tom Morris 9 months ago

  • Target version set to 2018-04-11 Sprint

#3 Updated by Tom Clegg 9 months ago

We should check for leaking FDs. We might be relying on garbage collection to clean up abandoned exec.Command fds.

#4 Updated by Tom Clegg 9 months ago

  • Assigned To set to Tom Clegg

#5 Updated by Tom Clegg 9 months ago

Tried to find a leak, but it looks like
  • we call Wait (or something that calls Wait) every time we use exec.Command
  • we use one arvados client object for all api calls
  • a dispatcher that's been running jobs all day has 4 FDs open

...and we only run one scancel at a time, so I'm guessing this happens during a busy time (lots of sbatch invocations).

I suppose we could rate-limit sbatch, but either way, 1024 is a silly limit anyway for a program like this, so:

13142-dispatch-nfiles @ 580949047505722b11956f09177a8537a1fef015

#6 Updated by Tom Clegg 8 months ago

  • Status changed from New to In Progress

#7 Updated by Lucas Di Pentima 8 months ago

Not much to review :) this LGTM, thanks.

#8 Updated by Tom Clegg 8 months ago

  • Subject changed from crunch-dispatch-slurm systemd unit does not set nofile limit, and crunch-dispatch-slurm does not handle "too many open files" errors well to crunch-dispatch-slurm uses a surprising number of file descriptors, and stops dispatching when it gets a "too many open files" error
  • Status changed from In Progress to New
  • Assigned To deleted (Tom Clegg)
  • Target version changed from 2018-04-11 Sprint to Arvados Future Sprints

Also available in: Atom PDF