Bug #11209

stuck keep fuse mounts not cleared by crunch-job

Added by Joshua Randall 9 months ago. Updated 6 months ago.

Status:ResolvedStart date:03/02/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:FUSE
Target version:-
Story points-Remaining (hours)0.00 hour
Velocity based estimate-

Description

crunch-job attempts to unmount any fuse filesystems that are mounted under $CRUNCH_TMP but it attempts to do so only using fusermount. Often on our system, this fails and a "umount -f <mount_point>" is required to make the node work again.

In addition, this often happens on multiple nodes at the same time - and by the time we have three nodes with wedged fuse mounts, they will rapidly fail all pending jobs. There seems to be no mechanism by which crunch dispatch can decide to stop trying to dispatch to a node that is broken.

Here is the log from a job that suffered from this issue.

dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-05-07 z8ta6-7ekkf-sa1q59632vhxov6 {"total_cpu_cores":32,"total_ram_mb":257867,"total_scratch_mb":788561}
2017-02-28_17:23:33 salloc: Granted job allocation 17536
2017-02-28_17:23:33 58397  Sanity check is `/usr/bin/docker ps -q`
2017-02-28_17:23:33 58397  sanity check: start
2017-02-28_17:23:33 58397  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q']
2017-02-28_17:23:33 58397  sanity check: exit 0
2017-02-28_17:23:33 58397  Sanity check OK
2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397  running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001
2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397  check slurm allocation
2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397  node humgen-05-07 - 10 slots
2017-02-28_17:23:33 z8ta6-8i9sb-8mp2qww92moa644 58397  start
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  clean work dirs: start
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  stderr starting: ['srun','--nodelist=humgen-05-07','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-07.10.keep: Invalid argument
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  stderr srun: error: humgen-05-07: task 0: Exited with exit code 123
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  clean work dirs: exit 123
2017-02-28_17:23:34 salloc: Relinquishing job allocation 17536
dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-04-02 z8ta6-7ekkf-ekzlxvozts92sqm {"total_cpu_cores":40,"total_ram_mb":193289,"total_scratch_mb":68302106}
2017-02-28_17:23:35 salloc: error: Unable to allocate resources: Requested nodes are busy
2017-02-28_17:23:35 salloc: Job allocation 17539 has been revoked.
dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-05-03 z8ta6-7ekkf-1i1v5zotflg26jn {"total_cpu_cores":32,"total_ram_mb":257867,"total_scratch_mb":788561}
2017-02-28_17:23:36 salloc: Granted job allocation 17540
2017-02-28_17:23:36 58715  Sanity check is `/usr/bin/docker ps -q`
2017-02-28_17:23:36 58715  sanity check: start
2017-02-28_17:23:36 58715  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q']
2017-02-28_17:23:36 58715  sanity check: exit 0
2017-02-28_17:23:36 58715  Sanity check OK
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  check slurm allocation
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  node humgen-05-03 - 10 slots
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  start
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  clean work dirs: start
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  stderr starting: ['srun','--nodelist=humgen-05-03','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-03.4.keep: Invalid argument
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  stderr srun: error: humgen-05-03: task 0: Exited with exit code 123
2017-02-28_17:23:38 z8ta6-8i9sb-8mp2qww92moa644 58715  clean work dirs: exit 123
2017-02-28_17:23:38 salloc: Relinquishing job allocation 17540
2017-02-28_17:23:38 close failed in file object destructor:
2017-02-28_17:23:38 sys.excepthook is missing
2017-02-28_17:23:38 lost sys.stderr
dispatching job z8ta6-8i9sb-8mp2qww92moa644 {"docker_image"=>"mercury/gatk-3.5", "min_nodes"=>1, "max_tasks_per_node"=>10, "keep_cache_mb_per_task"=>1280} to humgen-04-02 z8ta6-7ekkf-ekzlxvozts92sqm {"total_cpu_cores":40,"total_ram_mb":193289,"total_scratch_mb":68302106}
2017-02-28_17:23:40 salloc: Granted job allocation 17544
2017-02-28_17:23:40 58985  Sanity check is `/usr/bin/docker ps -q`
2017-02-28_17:23:40 58985  sanity check: start
2017-02-28_17:23:40 58985  stderr starting: ['srun','--nodes=1','--ntasks-per-node=1','/usr/bin/docker','ps','-q']
2017-02-28_17:23:40 58985  sanity check: exit 0
2017-02-28_17:23:40 58985  Sanity check OK
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  running from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/arvados-cli-0.1.20170217221854/bin/crunch-job with arvados-cli Gem version(s) 0.1.20170217221854, 0.1.20161017193526, 0.1.20160503204200, 0.1.20151207150126, 0.1.20151023190001
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  check slurm allocation
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  node humgen-04-02 - 10 slots
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  start
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  clean work dirs: start
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  stderr starting: ['srun','--nodelist=humgen-04-02','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-04-02.9.keep: Invalid argument
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  stderr srun: error: humgen-04-02: task 0: Exited with exit code 123
2017-02-28_17:23:41 z8ta6-8i9sb-8mp2qww92moa644 58985  clean work dirs: exit 123
2017-02-28_17:23:41 salloc: Relinquishing job allocation 17544
2017-02-28_17:23:41 close failed in file object destructor:
2017-02-28_17:23:41 sys.excepthook is missing
2017-02-28_17:23:41 lost sys.stderr


Subtasks

Task #11377: Honor subtype argResolvedTom Clegg

Task #11378: Warn that most users don't want --unmount-allResolvedTom Clegg

Task #11292: Review 11209-unmount-replaceResolvedLucas Di Pentima

Task #11376: Review 11209-unmount-subtypeResolvedLucas Di Pentima

Task #11353: use arv-mount --unmount-all in crunch-jobResolvedTom Clegg

Task #11504: review 11209-crunch-unmount-allResolvedLucas Di Pentima

Associated revisions

Revision fe0751fd
Added by Tom Clegg 8 months ago

Merge branch '11209-unmount-replace'

refs #11209

Revision fbc867e0
Added by Tom Clegg 8 months ago

11209: Restore missing import.

refs #11209

Revision 2f52a6e7
Added by Tom Clegg 8 months ago

Merge branch '11209-unmount-subtype'

refs #11209

Revision 9cab6a09
Added by Tom Clegg 7 months ago

Merge branch '11209-crunch-unmount-all'

refs #11209

Revision fc2eaa20
Added by Tom Clegg 7 months ago

Fix crunch-run tests.

refs #11209

History

#1 Updated by Tom Clegg 9 months ago

Normally crunch-job frees up mount points using fusermount -u -z but for some reason it isn't working here:

2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  clean work dirs: start
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  stderr starting: ['srun','--nodelist=humgen-05-07','-D','/data/crunch-tmp','bash','-ec','-o','pipefail','mount -t fuse,fuse.keep | awk "(index(\\$3, \\"$CRUNCH_TMP\\") == 1){print \\$3}" | xargs -r -n 1 fusermount -u -z; sleep 1; rm -rf $JOB_WORK $CRUNCH_INSTALL $CRUNCH_TMP/task $CRUNCH_TMP/src* $CRUNCH_TMP/*.cid']
2017-02-28_17:23:34 z8ta6-8i9sb-8mp2qww92moa644 58397  stderr fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-07.10.keep: Invalid argument

Could this be https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=632258 ? (Looks similar, seems to have been fixed by upgrading fuse from 2.8.5-3 to 2.9.2-4.)

On a debian jessie and ubuntu xenial test systems:
  • writing 1 to /sys/fs/fuse/connections/ZZZ/abort (where ZZZ is the device minor number from /proc/self/mountinfo) kills arv-mount and puts the mountpoint in "transport endpoint is not connected" state, but has no effect at all on a mountpoint that's in that state already. (The fuse docs claim this is the way to kill a mount that "always works".)
  • "umount", "umount -l", "umount -f" all fail EPERM
  • "fusermount -z -u" always works

If "umount" needs root and "fusermount" doesn't work, I'm not sure what we should do. We could use a different mount point, but that would cause zombie mountpoints to accumulate over time, which could eventually put the system in an even worse state (although at least it would take longer to get there).

#2 Updated by Tom Clegg 9 months ago

  • Category set to FUSE
  • Status changed from New to In Progress
  • Assignee set to Tom Clegg

#3 Updated by Joshua Randall 8 months ago

When I run `umount -f` to clear the problem, it has always been as root. Never tried running it as any other user.

#4 Updated by Tom Clegg 8 months ago

The fuse bug seems to be related to a double-mounted mount point. Perhaps the trick is to avoid getting into this state by waiting for the mount to detach (perhaps by calling stat until it works) after calling "fusermount -u -z".

(This problem is occurring on systems with fuse≥2.9.2-4, where supposedly that bug is fixed -- but this seems like good race-prevention behavior anyway.)

#5 Updated by Tom Clegg 8 months ago

  • Target version set to 2017-03-29 sprint

#6 Updated by Tom Clegg 8 months ago

11209-unmount-replace @ 5752685c137c5e37e13845f5328e9a3930fa3100

This should let us replace the "mount|awk|grep|xargs fusermount;sleep" script in crunch-job with "arv-mount --unmount $CRUNCH_TMP/..." and ensure we don't try to proceed any further until all fuse mounts are detached.

#7 Updated by Lucas Di Pentima 8 months ago

  • File services/fuse/arvados_fuse/command.py
    • Line 14: Can this line be eliminated because of line 15?
    • Shouldn’t self.args.replace have the same semantics as self.args.unmount regarding the unmount_all() feature?
  • Reusing self.args.unmount_timeout on unmount()/unmount_all() may be problematic as it seems that has a different meaning when used on __exit__, for example it seems that if the user specifies unmount_timeout=0, the unmounting won’t have a timeout, and OTOH, the rest of the code seems to be using unmount_timeout=0 as "don't wait", right?
  • Using an "unmount_timeout < 0" would always produce a timeout exception without trying at least once to unmount.
  • Should these new flags have their related tests?

#8 Updated by Tom Clegg 8 months ago

Lucas Di Pentima wrote:

  • File services/fuse/arvados_fuse/command.py
    • Line 14: Can this line be eliminated because of line 15?

Sure, don't see why not.

  • Shouldn’t self.args.replace have the same semantics as self.args.unmount regarding the unmount_all() feature?

The only difference is that "/path/..." means "/path and any mountpoint below it" in unmount_all(). So the question is about what should happen if someone runs

arv-mount --replace /path/...

I figure since we'll try to mount at the literal path "/path/..." we have to assume "/path/..." really means just "/path/..." and only unmount whatever we find at that specific path, not "everything under /path".

Does this make sense?

  • Reusing self.args.unmount_timeout on unmount()/unmount_all() may be problematic as it seems that has a different meaning when used on __exit__, for example it seems that if the user specifies unmount_timeout=0, the unmounting won’t have a timeout, and OTOH, the rest of the code seems to be using unmount_timeout=0 as "don't wait", right?
  • Using an "unmount_timeout < 0" would always produce a timeout exception without trying at least once to unmount.

Ah, yes, unmount(timeout=0) means "raise exception" which seems useless. Fixed so it always tries at least once.

  • Should these new flags have their related tests?

I'm dreading finding new ways for threads/processes to deadlock and leave fuse in weird states ... but yes, it should be possible to make a test case that runs some arv-mount child processes and unmounts them with another.

#9 Updated by Tom Clegg 8 months ago

11209-unmount-replace @ b7a664f09052ac048e506bed9bb48b54bc2a9bd4
  • remove superfluous crunchstat import
  • unmount(timeout=0) tries unmount 1x
  • test cases for --unmount and --replace
  • fix missing import so --unmount and --replace actually work (thanks, new test cases!)

#10 Updated by Lucas Di Pentima 8 months ago

Tom Clegg wrote:

I figure since we'll try to mount at the literal path "/path/..." we have to assume "/path/..." really means just "/path/..." and only unmount whatever we find at that specific path, not "everything under /path".
Does this make sense?

It makes sense, and in that case, it brings me another doubt, if we use "/path/…" as a literal on the args.replace case, shouldn’t we have to check if "/path/…" exists when using args.unmount before asuming we’re trying to unmount all mounted dirs below "/path/"? Or maybe, if this convention is too confusing, use an additional flag for the recursive unmount feature?

I'm dreading finding new ways for threads/processes to deadlock and leave fuse in weird states ... but yes, it should be possible to make a test case that runs some arv-mount child processes and unmounts them with another.

I've run them on my local machine, and got some errors, for example:

======================================================================
ERROR: test_replace (tests.test_unmount.UnmountTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lucas/arvados_local/services/fuse/tests/test_unmount.py", line 29, in test_replace
    '--exec', 'true'])
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['arv-mount', '--subtype', 'test', '--replace', '--unmount-timeout', '10', '/tmp/tmp1_nFm2', '--exec', 'true']' returned non-zero exit status 1

======================================================================
ERROR: test_replace (tests.test_unmount.UnmountTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lucas/arvados_local/services/fuse/tests/test_unmount.py", line 15, in tearDown
    super(UnmountTest, self).tearDown()
  File "/home/lucas/arvados_local/services/fuse/tests/integration_test.py", line 66, in tearDown
    os.rmdir(self.mnt)
OSError: [Errno 16] Device or resource busy: '/tmp/tmp1_nFm2'

#11 Updated by Tom Clegg 8 months ago

Fixed a race condition in the tests, and a problem with the refactored "standalone mode" code (evidently it's critical to do DaemonContext() before subscribing to websocket). That might have caused the unmount tests to fail unreliably in b7a66.

"--unmount /path/..." is now "--unmount-all /path"

11209-unmount-replace @ 8b4d5991f9d5691b9fa2898d6f60eef8dbfdf987

#12 Updated by Lucas Di Pentima 8 months ago

LGTM. All tests passing now.

#13 Updated by Tom Clegg 8 months ago

  • Target version changed from 2017-03-29 sprint to 2017-04-12 sprint

#14 Updated by Tom Clegg 8 months ago

#15 Updated by Lucas Di Pentima 8 months ago

LGTM.

#16 Updated by Tom Clegg 8 months ago

  • Target version changed from 2017-04-12 sprint to 2017-04-26 sprint

#17 Updated by Tom Clegg 7 months ago

11209-crunch-unmount-all @ d64ed33e94700f8204ec8089c7b235cff918f9f7

2017-04-14_17:20:15 4xphq-8i9sb-5fhfjo3g28krpw5 1564  clean work dirs: start
2017-04-14_17:20:15 4xphq-8i9sb-5fhfjo3g28krpw5 1564  stderr starting: ['srun','--nodelist=compute1','-D','/tmp','bash','-ec',' arv-mount --unmount-timeout 10 --unmount-all ${CRUNCH_TMP} rm -rf ${JOB_WORK} ${CRUNCH_INSTALL} ${CRUNCH_TMP}/task ${CRUNCH_TMP}/src* ${CRUNCH_TMP}/*.cid     ']
2017-04-14_17:20:16 4xphq-8i9sb-5fhfjo3g28krpw5 1564  clean work dirs: exit 0

-- https://workbench.4xphq.arvadosapi.com/jobs/4xphq-8i9sb-5fhfjo3g28krpw5#Log

#19 Updated by Joshua Randall 7 months ago

I now have a wedged arv-mount on one of my compute nodes on which I have the new arv-mount.

Unfortunately, the new `--unmount-all` option does not appear to clear the stuck mount:

root@humgen-05-13:~# arv-mount --version
/usr/bin/arv-mount 0.1.20170407172413
root@humgen-05-13:~# mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
root@humgen-05-13:~# arv-mount --unmount-all /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep
root@humgen-05-13:~# mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)

#20 Updated by Joshua Randall 7 months ago

I've just been looking through the code to try to figure this out and it looks like the issue is that the wedged mount is not showing up in /proc/self/mountinfo (neither for root nor for the crunch user):

root@humgen-05-13:~# mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
root@humgen-05-13:~# cat /proc/self/mountinfo | grep fuse
23 17 0:17 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
crunch@humgen-05-13:/$ cat /proc/self/mountinfo | grep fuse
23 17 0:17 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
crunch@humgen-05-13:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)

However, running with `--replace` does seem to work:

crunch@humgen-05-13:/$ arv-mount /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep
mount: according to mtab, /dev/fuse is already mounted on /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep

2017-04-19 00:10:46 arvados.arv-mount[20686] ERROR: arv-mount: exception during mount: fuse_mount failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 365, in _run_standalone
    with self:
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 133, in __enter__
    llfuse.init(self.operations, self.args.mountpoint, self._fuse_options())
  File "llfuse/fuse_api.pxi", line 253, in llfuse.capi.init (src/llfuse/capi_linux.c:24362)
RuntimeError: fuse_mount failed
crunch@humgen-05-13:/$ arv-mount --replace /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep
crunch@humgen-05-13:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep type fuse (rw,nosuid,nodev,max_read=131072,user=crunch)
crunch@humgen-05-13:/$ ls /data/crunch-tmp/crunch-job/task/humgen-05-13.1.keep
by_id  by_tag  home  README  shared

#21 Updated by Joshua Randall 7 months ago

Another wedged node, this time arv-mount --unmount did work:

crunch@humgen-02-02:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-02-02.4.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-02-02:/$ ls /data/crunch-tmp/crunch-job/task/humgen-02-02.4.keep
crunch@humgen-02-02:/$ arv-mount /data/crunch-tmp/crunch-job/task/humgen-02-02.4.keep
mount: according to mtab, /dev/fuse is already mounted on /data/crunch-tmp/crunch-job/task/humgen-02-02.4.keep

2017-04-19 00:23:39 arvados.arv-mount[9218] ERROR: arv-mount: exception during mount: fuse_mount failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 365, in _run_standalone
    with self:
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 133, in __enter__
    llfuse.init(self.operations, self.args.mountpoint, self._fuse_options())
  File "llfuse/fuse_api.pxi", line 253, in llfuse.capi.init (src/llfuse/capi_linux.c:24362)
RuntimeError: fuse_mount failed
crunch@humgen-02-02:/$ arv-mount --unmount /data/crunch-tmp/crunch-job/task/humgen-02-02.4.keep
crunch@humgen-02-02:/$ ls /data/crunch-tmp/crunch-job/task/humgen-02-02.4.keep
crunch@humgen-02-02:/$ mount -t fuse
crunch@humgen-02-02:/$ arv-mount /data/crunch-tmp/crunch-job/task/humgen-02-02.4.keep
crunch@humgen-02-02:/$ ls /data/crunch-tmp/crunch-job/task/humgen-02-02.4.keep
by_id  by_tag  home  README  shared
crunch@humgen-02-02:/$ arv-mount --version
/usr/bin/arv-mount 0.1.20170407172413

#22 Updated by Joshua Randall 7 months ago

Another wedged node. On this one, neither --unmount-all nor --unmount worked until after I attempted to mount at the wedged mount point. After that attempt, the --unmount worked:

crunch@humgen-05-03:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ ls /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
crunch@humgen-05-03:/$ ps auxwww|grep arv-mount
crunch    7047  0.0  0.0   9388   912 pts/2    S+   00:27   0:00 grep arv-mount
crunch@humgen-05-03:/$ arv-mount --unmount-all /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
crunch@humgen-05-03:/$ ps auxwww|grep arv-mount
crunch    7140  0.0  0.0   9388   912 pts/2    S+   00:28   0:00 grep arv-mount
crunch@humgen-05-03:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ arv-mount --unmount /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
crunch@humgen-05-03:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ arv-mount /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
mount: according to mtab, /dev/fuse is already mounted on /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep

2017-04-19 00:29:12 arvados.arv-mount[7367] ERROR: arv-mount: exception during mount: fuse_mount failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 365, in _run_standalone
    with self:
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 133, in __enter__
    llfuse.init(self.operations, self.args.mountpoint, self._fuse_options())
  File "llfuse/fuse_api.pxi", line 253, in llfuse.capi.init (src/llfuse/capi_linux.c:24362)
RuntimeError: fuse_mount failed
crunch@humgen-05-03:/$ arv-mount --unmount /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
crunch@humgen-05-03:/$ mount -t fuse
crunch@humgen-05-03:/$ arv-mount /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
crunch@humgen-05-03:/$ ls /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
by_id  by_tag  home  README  shared
crunch@humgen-05-03:/$ arv-mount --unmount /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
crunch@humgen-05-03:/$ ls /data/crunch-tmp/crunch-job/task/humgen-05-03.5.keep
crunch@humgen-05-03:/$ mount -t fuse

#23 Updated by Joshua Randall 7 months ago

Found one last node that is wedged, and managed to do some more diagnosing. It looks like when it is wedged on our systems:
- There is an entry in /etc/mtab
- There is initially no entry in /proc/self/mountinfo so `arv-mount --unmount` and `arv-mount --unmount-all` fail
- After attempting (and failing) to mount over the existing mountpoint, the entry appears in /proc/self/mountinfo after which the `arv-mount --unmount` succeeds

crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ cat /proc/self/mountinfo | grep fuse
23 17 0:17 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
crunch@humgen-05-16:/$ ls /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep/
crunch@humgen-05-16:/$ ps auxwww|grep arv-m
crunch   31545  0.0  0.0   9388   912 pts/2    S+   00:33   0:00 grep arv-m
crunch@humgen-05-16:/$ arv-mount --unmount-all /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep
crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ arv-mount --unmount /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep
crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ cat /proc/self/mountinfo | grep humgen-05-16.2.keep
crunch@humgen-05-16:/$ ls -l /sys/fs/fuse/connections/
total 0
crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ fusermount -u -z /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep
fusermount: failed to unmount /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep: Invalid argument
crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ python
Python 2.7.3 (default, Oct 26 2016, 21:01:49)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from arvados_fuse.unmount import unmount
>>> unmount(path='/data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep', subtype='', timeout=2.0, recursive=False)
False
>>> quit()
crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ python
Python 2.7.3 (default, Oct 26 2016, 21:01:49)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from arvados_fuse.unmount import unmount
>>> unmount(path='/data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep', subtype=None, timeout=2.0, recursive=False)
False
>>> quit()
crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ python
Python 2.7.3 (default, Oct 26 2016, 21:01:49)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from arvados_fuse.unmount import unmount
>>> unmount(path='/data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep', timeout=2.0)
False
>>> quit()
crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ cat /proc/self/mountinfo | grep humgen-05-16.2.keep
crunch@humgen-05-16:/$ arv-mount /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep
mount: according to mtab, /dev/fuse is already mounted on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep

2017-04-19 00:45:00 arvados.arv-mount[2074] ERROR: arv-mount: exception during mount: fuse_mount failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 365, in _run_standalone
    with self:
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 133, in __enter__
    llfuse.init(self.operations, self.args.mountpoint, self._fuse_options())
  File "llfuse/fuse_api.pxi", line 253, in llfuse.capi.init (src/llfuse/capi_linux.c:24362)
RuntimeError: fuse_mount failed
crunch@humgen-05-16:/$ cat /proc/self/mountinfo | grep humgen-05-16.2.keep
43 39 0:31 / /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep rw,nosuid,nodev,relatime - fuse /dev/fuse rw,user_id=15324,group_id=1593,max_read=131072
crunch@humgen-05-16:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-16:/$ arv-mount --unmount /data/crunch-tmp/crunch-job/task/humgen-05-16.2.keep
crunch@humgen-05-16:/$ mount -t fuse

#24 Updated by Joshua Randall 7 months ago

kernel and system versions:

root@humgen-05-13:~# uname -a
Linux humgen-05-13 3.13.0-85-generic #129~precise1-Ubuntu SMP Fri Mar 18 17:38:08 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
root@humgen-05-13:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 12.04.5 LTS
Release:        12.04
Codename:       precise

#25 Updated by Joshua Randall 7 months ago

Another set of machines were wedged today. Did some more testing on the call with Tom:

First, on humgen-02-02 we established that `arv-mount --replace` does NOT work initially (before a failed attempt to mount):

crunch@humgen-02-02:/$ cat /proc/self/mountinfo | grep fuse
23 17 0:17 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
crunch@humgen-02-02:/$ cat /etc/mtab
/dev/sda6 / ext4 rw,relatime,errors=remount-ro,user_xattr 0 0
proc /proc proc rw,noexec,nosuid,nodev 0 0
sysfs /sys sysfs rw,noexec,nosuid,nodev 0 0
none /sys/fs/fuse/connections fusectl rw 0 0
none /sys/kernel/debug debugfs rw 0 0
none /sys/kernel/security securityfs rw 0 0
udev /dev devtmpfs rw,mode=0755 0 0
devpts /dev/pts devpts rw,noexec,nosuid,gid=5,mode=0620 0 0
tmpfs /run tmpfs rw,noexec,nosuid,size=10%,mode=0755 0 0
none /run/lock tmpfs rw,noexec,nosuid,nodev,size=5242880 0 0
none /run/shm tmpfs rw,nosuid,nodev 0 0
cgroup /sys/fs/cgroup tmpfs rw,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0
/dev/sda1 /boot ext4 rw,errors=remount-ro 0 0
/dev/mapper/data-1 /data ext4 rw 0 0
rpc_pipefs /run/rpc_pipefs rpc_pipefs rw 0 0
/dev/fuse /data/crunch-tmp/crunch-job/task/humgen-02-02.3.keep fuse rw,nosuid,nodev,allow_other,max_read=131072,user=crunch 0 0
crunch@humgen-02-02:/$ arv-mount --replace /data/crunch-tmp/crunch-job/task/humgen-02-02.3.keep
mount: according to mtab, /dev/fuse is already mounted on /data/crunch-tmp/crunch-job/task/humgen-02-02.3.keep

2017-04-20 16:26:27 arvados.arv-mount[61468] ERROR: arv-mount: exception during mount: fuse_mount failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 365, in _run_standalone
    with self:
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 133, in __enter__
    llfuse.init(self.operations, self.args.mountpoint, self._fuse_options())
  File "llfuse/fuse_api.pxi", line 253, in llfuse.capi.init (src/llfuse/capi_linux.c:24362)
RuntimeError: fuse_mount failed
crunch@humgen-02-02:/$ cat /proc/self/mountinfo | grep fuse
23 17 0:17 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
42 39 0:31 / /data/crunch-tmp/crunch-job/task/humgen-02-02.3.keep rw,nosuid,nodev,relatime - fuse /dev/fuse rw,user_id=15324,group_id=1593,max_read=131072
crunch@humgen-02-02:/$ arv-mount --replace /data/crunch-tmp/crunch-job/task/humgen-02-02.3.keep
crunch@humgen-02-02:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-02-02.3.keep type fuse (rw,nosuid,nodev,max_read=131072,user=crunch)
crunch@humgen-02-02:/$ fusermount -u /data/crunch-tmp/crunch-job/task/humgen-02-02.3.keep
crunch@humgen-02-02:/$ mount -t fuse
crunch@humgen-02-02:/$ exit

Then, on humgen-05-03, we discovered that you can use --subtype to mount a different fuse subtype on top of the old mountpoint. However, once unmounted the original "Transport endpoint is not connected" error returns (and there is still an entry in mtab but not in mountinfo until after attempting to mount). It does work to call `arv-mount --replace` twice in a row (first fails, second succeeds):

crunch@humgen-05-03:/$ cat /proc/self/mountinfo  | grep fuse
24 17 0:18 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
crunch@humgen-05-03:/$ clear
crunch@humgen-05-03:/$ arv-mount --subtype foo --replace /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep
crunch@humgen-05-03:/$ ls /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep
by_id  by_tag  home  README  shared
crunch@humgen-05-03:/$ mount -t fuse.foo
foo on /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep type fuse.foo (rw,nosuid,nodev,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ cat /etc/mtab |grep fuse
none /sys/fs/fuse/connections fusectl rw 0 0
/dev/fuse /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep fuse rw,nosuid,nodev,allow_other,max_read=131072,user=crunch 0 0
foo /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep fuse.foo rw,nosuid,nodev,max_read=131072,user=crunch 0 0
crunch@humgen-05-03:/$ cat /proc/self/mountinfo |grep fuse
24 17 0:18 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
43 40 0:31 / /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep rw,nosuid,nodev,relatime - fuse.foo foo rw,user_id=15324,group_id=1593,max_read=131072
crunch@humgen-05-03:/$ mount -t fuse.foo
foo on /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep type fuse.foo (rw,nosuid,nodev,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ arv-mount --subtype bar --replace /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep
crunch@humgen-05-03:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ mount -t fuse.foo
crunch@humgen-05-03:/$ mount -t fuse.bar
bar on /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep type fuse.bar (rw,nosuid,nodev,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ ps auxwww|grep arv-m
crunch   23232  1.5  0.0 474972 27840 ?        Sl   16:34   0:00 /usr/bin/python2.7 /usr/bin/arv-mount --subtype bar --replace /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep
crunch   23297  0.0  0.0   9388   916 pts/2    S+   16:34   0:00 grep arv-m
crunch@humgen-05-03:/$ mount -t fuse
/dev/fuse on /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep type fuse (rw,nosuid,nodev,allow_other,max_read=131072,user=crunch)
crunch@humgen-05-03:/$ arv-mount --replace /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep
mount: according to mtab, /dev/fuse is already mounted on /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep

2017-04-20 16:35:16 arvados.arv-mount[23408] ERROR: arv-mount: exception during mount: fuse_mount failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 365, in _run_standalone
    with self:
  File "/usr/lib/python2.7/dist-packages/arvados_fuse/command.py", line 133, in __enter__
    llfuse.init(self.operations, self.args.mountpoint, self._fuse_options())
  File "llfuse/fuse_api.pxi", line 253, in llfuse.capi.init (src/llfuse/capi_linux.c:24362)
RuntimeError: fuse_mount failed
crunch@humgen-05-03:/$ cat /proc/self/mountinfo | grep self
crunch@humgen-05-03:/$ cat /proc/self/mountinfo | grep fuse
24 17 0:18 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
43 40 0:31 / /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep rw,nosuid,nodev,relatime - fuse /dev/fuse rw,user_id=15324,group_id=1593,max_read=131072
crunch@humgen-05-03:/$ ls /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep
ls: cannot access /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep: Transport endpoint is not connected
crunch@humgen-05-03:/$ ls /sys/fs/fuse/connections/31/
abort                 congestion_threshold  max_background        waiting
crunch@humgen-05-03:/$ ls /sys/fs/fuse/connections/31/
abort                 congestion_threshold  max_background        waiting
crunch@humgen-05-03:/$ echo "1" | /sys/fs/fuse/connections/31/abort
-su: /sys/fs/fuse/connections/31/abort: Permission denied
crunch@humgen-05-03:/$ echo "1" > /sys/fs/fuse/connections/31/abort
crunch@humgen-05-03:/$ cat /proc/self/mountinfo | grep fuse
24 17 0:18 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
43 40 0:31 / /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep rw,nosuid,nodev,relatime - fuse /dev/fuse rw,user_id=15324,group_id=1593,max_read=131072
crunch@humgen-05-03:/$ ps auxwww|grep arv-m
crunch   24618  0.0  0.0   9388   912 pts/2    S+   16:39   0:00 grep arv-m
crunch@humgen-05-03:/$ ls /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep
ls: cannot access /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep: Transport endpoint is not connected
crunch@humgen-05-03:/$ cat /etc/mtab | grep fuse
none /sys/fs/fuse/connections fusectl rw 0 0
/dev/fuse /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep fuse rw,nosuid,nodev,allow_other,max_read=131072,user=crunch 0 0
crunch@humgen-05-03:/$ cat /etc/mtab | grep fuse
none /sys/fs/fuse/connections fusectl rw 0 0
/dev/fuse /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep fuse rw,nosuid,nodev,allow_other,max_read=131072,user=crunch 0 0
crunch@humgen-05-03:/$ fusermount -u -z /data/crunch-tmp/crunch-job/task/humgen-05-03.10.keep
crunch@humgen-05-03:/$ cat /etc/mtab | grep fuse
none /sys/fs/fuse/connections fusectl rw 0 0
crunch@humgen-05-03:/$ mount -t fuse
crunch@humgen-05-03:/$ exit

Finally, on humgen-05-10, we found that it does work to manually remove the offending line from /etc/mtab (as root) and then just going ahead with the arv-mount succeeds. This may suggest a race condition in updating /etc/mtab is what is causing the underlying problem?

crunch@humgen-05-10:/$ cat /etc/mtab
/dev/sda6 / ext4 rw 0 0
proc /proc proc rw,noexec,nosuid,nodev 0 0
sysfs /sys sysfs rw,noexec,nosuid,nodev 0 0
none /sys/fs/fuse/connections fusectl rw 0 0
none /sys/kernel/debug debugfs rw 0 0
none /sys/kernel/security securityfs rw 0 0
udev /dev devtmpfs rw,mode=0755 0 0
devpts /dev/pts devpts rw,noexec,nosuid,gid=5,mode=0620 0 0
tmpfs /run tmpfs rw,noexec,nosuid,size=10%,mode=0755 0 0
none /run/lock tmpfs rw,noexec,nosuid,nodev,size=5242880 0 0
none /run/shm tmpfs rw,nosuid,nodev 0 0
cgroup /sys/fs/cgroup tmpfs rw,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,relatime,hugetlb 0 0
/dev/sda7 /tmp ext4 rw 0 0
/dev/sda8 /data xfs rw 0 0
/dev/sda1 /boot ext4 rw,errors=remount-ro 0 0
rpc_pipefs /run/rpc_pipefs rpc_pipefs rw 0 0
/dev/fuse /data/crunch-tmp/crunch-job/task/humgen-05-10.7.keep fuse rw,nosuid,nodev,allow_other,max_read=131072,user=crunch 0 0
crunch@humgen-05-10:/$ cat /proc/self/mountinfo | grep fuse
23 17 0:17 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
crunch@humgen-05-10:/$ exit
logout
root@humgen-05-10:~# vi /etc/mtab   ### MANUALLY REMOVE MTAB ENTRY FOR /data/crunch-tmp/crunch-job/task/humgen-05-10.7.keep
root@humgen-05-10:~# su - crunch
No directory, logging in with HOME=/
crunch@humgen-05-10:/$ export ARVADOS_API_HOST=api.arvados.sanger.ac.uk
crunch@humgen-05-10:/$ export ARVADOS_API_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
crunch@humgen-05-10:/$ arv-mount /data/crunch-tmp/crunch-job/task/humgen-05-10.7.keep
crunch@humgen-05-10:/$ ls -l /data/crunch-tmp/crunch-job/task/humgen-05-10.7.keep
total 3
dr-xr-xr-x 1 crunch arvados   0 Apr 20 16:44 by_id
dr-xr-xr-x 1 crunch arvados   0 Apr 20 16:44 by_tag
dr-xr-xr-x 1 crunch arvados   0 Apr 20 16:44 home
-r--r--r-- 1 crunch arvados 512 Apr 20 16:44 README
dr-xr-xr-x 1 crunch arvados   0 Apr 20 16:44 shared

#26 Updated by Joshua Randall 7 months ago

There do generally seem to be race conditions with updating /etc/mtab by FUSE filesystems:

https://bugzilla.redhat.com/show_bug.cgi?id=651183
http://fuse.996288.n3.nabble.com/Security-Problem-in-fusermount-td12269.html

It may be a workaround for us to get rid of /etc/mtab and just symlink it to /proc/self/mounts as is done in newer systems?

#27 Updated by Tom Clegg 7 months ago

Joshua Randall wrote:

It may be a workaround for us to get rid of /etc/mtab and just symlink it to /proc/self/mounts as is done in newer systems?

Yes, this seems to be the way mtab-editing problems end up getting fixed. From https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=94076:

/etc/mtab is now a symlink to /proc/mounts. Bugs which were a result of editing /etc/mtab which make it get out of sync with the real kernel state are now no longer an issue.

#28 Updated by Tom Clegg 7 months ago

  • Status changed from In Progress to Feedback

#29 Updated by Tom Clegg 7 months ago

  • Target version deleted (2017-04-26 sprint)

#30 Updated by Tom Clegg 6 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF