Bug #12306

[arv-mount] --unmount should work on an unresponsive mount

Added by Tom Clegg over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
09/22/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Currently, if an arv-mount process is in some deadlocked/stuck state, running arv-mount --unmount PATH just hangs instead of unmounting.

When this happens, echo 1 > /sys/fs/fuse/connections/NNN/abort revives the stuck unmount command.

It looks like arv-mount --unmount attempts to lstat() all mount points in /proc/self/mounts and lstat(stuck_mount_path) hangs.

This seems to be the fault of realpath() in source:services/fuse/arvados_fuse/unmount.py:

    while True:
        mounted = False
        for m in mountinfo():
            if m.is_fuse and (mnttype is None or mnttype == m.mnttype):
                try:
                    if os.path.realpath(m.path) == path:

On the shell node where this happened, where /home and /home/foo are both symlinks, arv-mount /home/foo/keep results in /data-sdd/foo/keep appearing in /proc/self/mountinfo, which means realpath() is superfluous here. (Is that true on all systems?)


Subtasks

Task #12564: Review 12306-dont-stat-mountsResolvedPeter Amstutz


Related issues

Related to Arvados - Bug #11994: [arv-mount] Do not crash if /sys/fs/fuse/connections is emptyResolved07/19/2017

Related to Arvados - Bug #12538: crunch-run failing to terminate after completeResolved11/06/2017

Associated revisions

Revision 0af05308
Added by Tom Clegg over 1 year ago

Merge branch '12306-dont-stat-mounts'

fixes #12306

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Tom Clegg over 1 year ago

These stuck mounts come up occasionally on Jenkins. When they do, all builds get stuck ("UnmountTest" -- presumably because of this bug), until someone clears the stuck mounts manually using ".../connections/NNN/abort" or "fusermount -u -z".

#2 Updated by Tom Morris over 1 year ago

  • Target version set to 2017-11-08 Sprint

#3 Updated by Tom Morris over 1 year ago

  • Assigned To set to Tom Morris

#4 Updated by Tom Morris over 1 year ago

  • Status changed from New to In Progress
  • Assigned To changed from Tom Morris to Tom Clegg

#5 Updated by Tom Clegg over 1 year ago

#6 Updated by Peter Amstutz over 1 year ago

So following symlinks to mounts seems weird and not something you would normally do, however, the other thing that realpath() does is turn a relative path into an absolute path, which is probably what we were really trying to use it for. So how about adding this back in?

    path = os.path.abspath(path)

(abspath doesn't use stat(), only get os.getcwd()).

#7 Updated by Tom Clegg over 1 year ago

Peter Amstutz wrote:

So following symlinks to mounts seems weird and not something you would normally do

On our shell nodes $HOME is typically /home/username where /home is a symlink, so ~/keep doesn't appear in mountinfo but realpath(~/keep) does.

I wonder if it's worth implementing a more careful realpath() that can resolve ~/keep in such situations without calling lstat() on ~/keep itself. Seems like a bit of a rabbit hole, though.

(abspath doesn't use stat(), only get os.getcwd()).

Indeed, one less opportunity to fall into the realpath() hole. Added.

12306-dont-stat-mounts @ aabf1ca0e99701550f9af785e9f1fee098b0020a

#8 Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote:

Peter Amstutz wrote:

So following symlinks to mounts seems weird and not something you would normally do

On our shell nodes $HOME is typically /home/username where /home is a symlink, so /keep doesn't appear in mountinfo but realpath(/keep) does.

Got it. But does that mean arv-mount --umount won't actually work in this case, when you have a stuck mount which you are trying to unmount on a symlink path?

I wonder if it's worth implementing a more careful realpath() that can resolve ~/keep in such situations without calling lstat() on ~/keep itself. Seems like a bit of a rabbit hole, though.

How about calling realpath() on the parent directory and then joining it with the mount point?

#9 Updated by Tom Clegg over 1 year ago

Indeed, the previous version would have ended up calling realpath() on ~/keep on a system where $HOME contains symlinks.

I think I made it back from the rabbit hole with a version that avoids calling realpath in those cases.

12306-dont-stat-mounts @ 08a4ebba0e5bfbc179103ac5e6916164bc8083fa

#10 Updated by Peter Amstutz over 1 year ago

Tom Clegg wrote:

Indeed, the previous version would have ended up calling realpath() on ~/keep on a system where $HOME contains symlinks.

I think I made it back from the rabbit hole with a version that avoids calling realpath in those cases.

12306-dont-stat-mounts @ 08a4ebba0e5bfbc179103ac5e6916164bc8083fa

Tentatively, safer_realpath seems to work.

I just noticed that arv-mount --unmount requires an unnecessary API token:

$ arv-mount --unmount keep/
2017-11-08 09:49:38 arvados.arv-mount[7740] ERROR: Missing environment: 'ARVADOS_API_TOKEN'

Unmounting an arv-mount which is stuck with SIGSTOP does remove the mount but doesn't kill the daemon:

  1. arv-mount
  2. SIGSTOP
  3. arv-mount --unmount (works)
  4. SIGCONT
  5. arv-mount is still there

Could be a problem if it is occupying a lot of memory and refusing to go away on its own.

#11 Updated by Peter Amstutz over 1 year ago

My preferred method to bring the hammer down:

  1. abort if available
  2. sigkill
  3. fusermount -u -z

#12 Updated by Peter Amstutz over 1 year ago

Otherwise, the main goal of this bugfix (don't get stuck on realpath()) seems to be accomplished, so declare victory and merge.

LGTM

#13 Updated by Anonymous over 1 year ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:0af053088c83d1107866cb06fd6c5736d9065eee.

Also available in: Atom PDF