Feature #8163
open[FUSE] arv-mount should detect and log any files/dirs that are still open after unmounting
Description
Background¶
If you wrap a program in "arv-mount --exec" (e.g., by running a job on a compute node) and the wrapped program exits but some other process has started using the mount (e.g., by reading a file or having cwd in the mount), the FUSE mount will detach but arv-mount will stay alive until all open files/dirs are released.
In practice, this means a background process like "updatedb" can start while a crunch job task is running, and prevent arv-mount from exiting when the task container exits. This is difficult to debug: there's no mount point any more, so the usual "fuser" and "lsof" tools can't help you find the offending process even if you realize this is happening.
Demonstration¶
terminal 1 | terminal 2 |
$ date; arv-mount MNT --exec sh -c 'sleep 5; date'; date Thu Jan 7 15:35:03 EST 2016 Thu Jan 7 15:35:08 EST 2016 Thu Jan 7 15:35:14 EST 2016 |
$ date; (cd MNT/home; sleep 10); date Thu Jan 7 15:35:04 EST 2016 Thu Jan 7 15:35:14 EST 2016 |
Proposed improvement 1¶
When unmounting or receiving SIGUSR1, print (on stderr) a list of processes that still have open files/directories.- An llfuse.RequestContext has a "pid" field that (I hope) will make this information relatively easy to track and report.
- If it turns out to be much easier to print a message the first time a given PID does some operation after unmounting, or the first time any PID does some operation after we receive SIGUSR1, those options would be nearly as good.
This won't fix the problem but it will make it possible for a user/sysadmin to [a] figure out that this is why a job task isn't exiting even though its docker container has exited, and [b] track down which process is responsible for keeping arv-mount alive.
Proposed improvement 2¶
When using --exec, after the child exits, return IO errors for all operations. At least in some cases (like updatedb), this will have the desired effect of causing the intruding process to give up reasonably quickly so arv-mount can exit. (It should be possible to control this behavior with a command line switch, though: there might be some use cases where the current behavior is actually desired.)
Related issues