Project

General

Profile

Actions

Feature #8163

open

[FUSE] arv-mount should detect and log any files/dirs that are still open after unmounting

Added by Tom Clegg over 8 years ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
FUSE
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

Background

If you wrap a program in "arv-mount --exec" (e.g., by running a job on a compute node) and the wrapped program exits but some other process has started using the mount (e.g., by reading a file or having cwd in the mount), the FUSE mount will detach but arv-mount will stay alive until all open files/dirs are released.

In practice, this means a background process like "updatedb" can start while a crunch job task is running, and prevent arv-mount from exiting when the task container exits. This is difficult to debug: there's no mount point any more, so the usual "fuser" and "lsof" tools can't help you find the offending process even if you realize this is happening.

Demonstration

terminal 1 terminal 2
$ date; arv-mount MNT --exec sh -c 'sleep 5; date'; date
Thu Jan  7 15:35:03 EST 2016
Thu Jan  7 15:35:08 EST 2016
Thu Jan  7 15:35:14 EST 2016
$ date; (cd MNT/home; sleep 10); date
Thu Jan  7 15:35:04 EST 2016
Thu Jan  7 15:35:14 EST 2016

Proposed improvement 1

When unmounting or receiving SIGUSR1, print (on stderr) a list of processes that still have open files/directories.
  • An llfuse.RequestContext has a "pid" field that (I hope) will make this information relatively easy to track and report.
  • If it turns out to be much easier to print a message the first time a given PID does some operation after unmounting, or the first time any PID does some operation after we receive SIGUSR1, those options would be nearly as good.

This won't fix the problem but it will make it possible for a user/sysadmin to [a] figure out that this is why a job task isn't exiting even though its docker container has exited, and [b] track down which process is responsible for keeping arv-mount alive.

Proposed improvement 2

When using --exec, after the child exits, return IO errors for all operations. At least in some cases (like updatedb), this will have the desired effect of causing the intruding process to give up reasonably quickly so arv-mount can exit. (It should be possible to control this behavior with a command line switch, though: there might be some use cases where the current behavior is actually desired.)


Related issues

Related to Arvados - Bug #8288: arv-mount / crunchstat in a crunch job fails to exit because reasonsResolvedTom Clegg01/23/2016Actions
Actions

Also available in: Atom PDF