Story #8345

[FUSE] Support llfuse 0.42+

Added by Brett Smith about 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
-
Start date:
02/03/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

run-build-packages-all-targets.log.txt (558 KB) run-build-packages-all-targets.log.txt Tom Clegg, 02/20/2016 09:50 AM
ubuntu1204-ruby-crash.txt (249 KB) ubuntu1204-ruby-crash.txt Tom Clegg, 03/01/2016 04:10 PM

Subtasks

Task #8660: Review Tom's commits on 8345-llfuse-unpin on arvados-devResolvedTom Clegg

Task #8603: Review/merge 8345-llfuse-unpin and 8345-arvados-fuse-llfuse-0.42ClosedTom Clegg

Task #8718: Solve intermittent hanging testsResolvedTom Clegg


Related issues

Related to Arvados - Bug #10805: [FUSE] Upgrade llfuse to 1.2, fix deadlock in test suiteResolved01/04/2017

Related to Arvados - Bug #12134: [arv-mount] Fix test deadlock by using new llfuse in test suiteResolved08/16/2017

Associated revisions

Revision 07c05348 (diff)
Added by Tom Clegg about 5 years ago

Fix dependencies after installing built package. refs #8345

Revision 07c05348 (diff)
Added by Tom Clegg about 5 years ago

Fix dependencies after installing built package. refs #8345

Revision 555b0396
Added by Tom Clegg about 5 years ago

Merge branch '8345-llfuse-unpin' refs #8345

Revision 555b0396
Added by Tom Clegg about 5 years ago

Merge branch '8345-llfuse-unpin' refs #8345

Revision 23391a8d (diff)
Added by Tom Clegg about 5 years ago

Fix backports dir search for PACKAGE="llfuse==0.41.1". refs #8345

Revision a0f1509b (diff)
Added by Tom Clegg about 5 years ago

Fix backports dir search for PACKAGE="llfuse==0.41.1".

Also, drop "-v 0.41.1" args from fpm-info.sh: cannot specify version
both here and in run-build-packages.sh.

refs #8345

Revision bfe7a624
Added by Tom Clegg over 4 years ago

Merge branch '8345-llfuse-1.0' closes #8345

Revision 48d81068 (diff)
Added by Tom Clegg over 4 years ago

8345: Build python27-python-contextlib2 for centos7.

refs #8345

Revision 8fc563f1
Added by Tom Clegg over 4 years ago

Merge branch '8345-revert-llfuse-to-0.41.1'

refs #8345

Revision 035a48b3 (diff)
Added by Ward Vandewege over 4 years ago

Fix run-tests.sh --only option for the workbench suite.

refs #8345

History

#1 Updated by Joshua Randall about 5 years ago

I've already implemented a version of arvados_fuse that runs against 0.42.1 (because I wanted it for the additional operations that get RequestContext). I'll split out the minimum needed to build and submit a PR.

From https://github.com/python-llfuse/python-llfuse/blob/release-0.42.1/Changes.rst:

BACKWARDS INCOMPATIBLE CHANGES:

Ducktyping the EntryAttributes class is no longer allowed, Operations methods must return instances of this class (rather than any object that provides the neccessary attributes).
The !st_Xtime attributes of the EntryAttributes have been dropped in favor of the !st_Xtime_ns attributes.
The setattr method now always receives a completely filled EntryAttributes instance. To determine which attributes should be changed, a new fields argument has been introduced.
The ~Operations.setattr method now also receives an fh parameter.
The llfuse.main function now has a workers parameter, and the single parameter was dropped.
Almost all request handlers now receive a RequestContext instance in an additional parameter.

#2 Updated by Joshua Randall about 5 years ago

A pair of pull requests implementing arvados_fuse support for llfuse >= 0.42 (including fuse backports for ubuntu1204):

https://github.com/curoverse/arvados-dev/pull/6
https://github.com/curoverse/arvados/pull/36

This has been tested and appears to be working on ubuntu1204. I also ran the jenkins script for all targets, which had some errors that seemed to be unrelated to the work I did. Not sure if it was a transient failure or what - I leave it to curoverse jenkins to test them completely (N.B. must be tested as an arvados / arvados-dev pair).

#3 Updated by Tom Clegg about 5 years ago

Thanks for doing this!

On my jessie dev system, pip upgraded my llfuse to 0.42.1, but then:

running build_ext
Traceback (most recent call last):
  File "setup.py", line 43, in <module>
    cmdclass={'egg_info': tagger},
...
  File "/home/tom/src/arvados/services/fuse/tests/test_command_args.py", line 2, in <module>
    import arvados_fuse
  File "/home/tom/src/arvados/services/fuse/arvados_fuse/__init__.py", line 77, in <module>
    llfuse.capi._notify_queue = Queue.Queue()
AttributeError: 'module' object has no attribute 'capi'

Haven't yet figured out what that's trying to tell me.

centos6 seems to have 2.8.3-4 -- does this mean centos6 won't be able to install python-arvados-fuse?

AFAIK jenkins doesn't make it easy to test a matched pair of arvados-dev and arvados branches without merging them (one of various pitfalls of having stuff split between the two repos). Will try to build locally, but probably won't get to it right away.

#4 Updated by Joshua Randall about 5 years ago

Oops - forgot to cherry-pick that change, I should have time to update it tomorrow from the market square in Krakow.

#5 Updated by Joshua Randall about 5 years ago

Regarding centos6, yes, we'll need a fuse 4.9.1+ in order for llfuse 0.42 to work. I guess I can try `fpm -s deb -t rpm` on the backported ubuntu 12.04 package i've created. I'll give that a try and see if it works...

#6 Updated by Joshua Randall about 5 years ago

I've managed to do another fuse 2.9.2 backport (this time from centos7 to centos6). Testing now, will push soon if all goes well.

#7 Updated by Joshua Randall about 5 years ago

I've complete the centos6 backport.

https://github.com/wtsi-hgi/arvados-dev/tree/8345-llfuse-unpin f46095f
https://github.com/wtsi-hgi/arvados/tree/8345-arvados-fuse-llfuse-0.42
0910614

#8 Updated by Joshua Randall about 5 years ago

Re: jenkins testing of arvados / arvados-dev repo pairs, why not include a sit submodule of arvados-dev in the arvados repo and set up jenkins to `git submodule update` in order to get the matched arvados-dev repo at exactly the submodule'd version.

#9 Updated by Joshua Randall about 5 years ago

In reviewing the python-llfuse 0.42+ support, I found an issue with the setattr operation, whose signature has changed and needs updating. I'll work on that now.

#10 Updated by Joshua Randall about 5 years ago

Everything should be working with llfuse 0.42 as of https://github.com/wtsi-hgi/arvados/tree/8345-arvados-fuse-llfuse-0.42 @ de8054d6

#11 Updated by Tom Clegg about 5 years ago

I added these to Josh's branch on arvados-dev, bringing it to arvados-dev|07c0534 on 8345-llfuse-unpin → arvados-dev|d7bda40 with master merged.
  • 07c0534 Fix dependencies after installing built package. refs #8345
  • 4640c72 8345: Add libselinux-dev build-dep missed by awk script, skip "recommended" packages

With those, everything seemed to work except workbench, which dies for seemingly unrelated reasons:

Using rails 4.1.12
Using sass-rails 5.0.1
Using piwik_analytics 1.0.2
Using themes_for_rails 0.5.1 from https://github.com/curoverse/themes_for_rails (at /arvados/apps/workbench/vendor/cache/themes_for_rails-61154877047d@6115487)
Updating files in vendor/cache
Bundle complete! 43 Gemfile dependencies, 108 gems now installed.
Bundled gems are installed into ./vendor/bundle.
rake aborted!
LoadError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found (required by /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/oj-2.11.2/ext/oj/oj.so) - /arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/oj-2.11.2/ext/oj/oj.so
/arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/activesupport-4.1.12/lib/active_support/dependencies.rb:247:in `require'
/arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/activesupport-4.1.12/lib/active_support/dependencies.rb:247:in `block in require'
/arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/activesupport-4.1.12/lib/active_support/dependencies.rb:232:in `load_dependency'
/arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/activesupport-4.1.12/lib/active_support/dependencies.rb:247:in `require'
/arvados/apps/workbench/vendor/bundle/ruby/2.1.0/gems/oj-2.11.2/lib/oj.rb:42:in `<top (required)>'
/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:77:in `require'
/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:77:in `block (2 levels) in require'
/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:72:in `each'
/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:72:in `block in require'
/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:61:in `each'
/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler/runtime.rb:61:in `require'
/usr/local/rvm/gems/ruby-2.1.5/gems/bundler-1.11.2/lib/bundler.rb:99:in `require'
/arvados/apps/workbench/config/application.rb:5:in `<top (required)>'
/arvados/apps/workbench/Rakefile:5:in `require'
/arvados/apps/workbench/Rakefile:5:in `<top (required)>'
(See full trace by running task with --trace)
ERROR: Asset precompilation failed
ERROR:  build failed with exit status 1.

I suspect this has to do with the apps/workbench/vendor/cache directory being shared among the build jobs for different OS vendors/versions.

I'm giving up for now.

#12 Updated by Nico César about 5 years ago

maybe is related to the glibc update done last week?

we have 2.13-38+deb7u8 there:

ci:/home/nico# grep libc6 /var/log/dpkg.log 
2016-02-17 15:26:13 upgrade libc6-dev:amd64 2.13-38+deb7u8 2.13-38+deb7u10
2016-02-17 15:26:13 status half-configured libc6-dev:amd64 2.13-38+deb7u8
2016-02-17 15:26:13 status unpacked libc6-dev:amd64 2.13-38+deb7u8
2016-02-17 15:26:13 status half-installed libc6-dev:amd64 2.13-38+deb7u8
2016-02-17 15:26:14 status half-installed libc6-dev:amd64 2.13-38+deb7u8
2016-02-17 15:26:14 status unpacked libc6-dev:amd64 2.13-38+deb7u10
2016-02-17 15:26:14 status unpacked libc6-dev:amd64 2.13-38+deb7u10
2016-02-17 15:26:16 upgrade libc6:amd64 2.13-38+deb7u8 2.13-38+deb7u10
2016-02-17 15:26:16 status half-configured libc6:amd64 2.13-38+deb7u8
2016-02-17 15:26:16 status unpacked libc6:amd64 2.13-38+deb7u8
2016-02-17 15:26:16 status half-installed libc6:amd64 2.13-38+deb7u8
2016-02-17 15:26:16 status half-installed libc6:amd64 2.13-38+deb7u8
2016-02-17 15:26:16 status unpacked libc6:amd64 2.13-38+deb7u10
2016-02-17 15:26:16 status unpacked libc6:amd64 2.13-38+deb7u10
2016-02-17 15:26:17 configure libc6:amd64 2.13-38+deb7u10 <none>
2016-02-17 15:26:17 status unpacked libc6:amd64 2.13-38+deb7u10
2016-02-17 15:26:17 status unpacked libc6:amd64 2.13-38+deb7u10
2016-02-17 15:26:17 status half-configured libc6:amd64 2.13-38+deb7u10
2016-02-17 15:26:18 status installed libc6:amd64 2.13-38+deb7u10
2016-02-17 15:26:30 configure libc6-dev:amd64 2.13-38+deb7u10 <none>
2016-02-17 15:26:30 status unpacked libc6-dev:amd64 2.13-38+deb7u10
2016-02-17 15:26:30 status half-configured libc6-dev:amd64 2.13-38+deb7u10
2016-02-17 15:26:30 status installed libc6-dev:amd64 2.13-38+deb7u10

Why is it looking for 2.14?

#13 Updated by Brett Smith about 5 years ago

  • Target version set to 2016-03-16 sprint

#14 Updated by Tom Clegg about 5 years ago

Hm.

[NOTE]
You may have encountered a bug in the Ruby interpreter or extension libraries.

#15 Updated by Brett Smith about 5 years ago

  • Story points set to 0.5

#16 Updated by Brett Smith about 5 years ago

  • Category set to FUSE
  • Assigned To set to Tom Clegg

#17 Updated by Joshua Randall about 5 years ago

I did successfully run `run-build-packages-all-targets.sh` when I completed the work ~2 weeks ago, so this is strange. I can try running it again here to see if it is failing for me now as well.

#18 Updated by Tom Clegg about 5 years ago

Phew. Now passing all package-building and -installing tests:
  • arvados 8345-arvados-fuse-llfuse-0.42 @ 0bbcf2e (merged master, otherwise unchanged from Josh's branch at de8054d)
  • arvados-dev 8345-llfuse-unpin @ arvados-dev|8eda2e7
arvados-dev needed only a few small changes. Each one seems to have taken me a fair amount of wall clock to figure out, unfortunately.

#19 Updated by Tom Clegg about 5 years ago

  • Status changed from New to In Progress

#20 Updated by Brett Smith about 5 years ago

Reviewing arvados-dev 8345-llfuse-pin at arvados-dev|commit:8eda2e7

I reviewed with git log -p. Individually all the commits make sense to me and the code's all clear. I'm running the build now, but I won't see the results until the morning. If you want to merge, I think that's fine; if you want to wait and see the results of the sanity check of running on someone else's system, I'll let you know what happens. Thanks.

#21 Updated by Tom Clegg about 5 years ago

Huh. I'm sure this test has been passing with the st_mtime_ns change, but I just happened to try running tests with a clean --temp, and now Something Is Different:

======================================================================
ERROR: runTest (tests.test_mount.FuseSharedTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/src/arvados/services/fuse/tests/test_mount.py", line 243, in runTest
    self.assertEqual(st.st_mtime_ns, 1397493304)
AttributeError: 'posix.stat_result' object has no attribute 'st_mtime_ns'

----------------------------------------------------------------------

#22 Updated by Tom Clegg about 5 years ago

Hm, looks like the change to st_mtime_ns might be buggy? I changed the test to use st_mtime if st_mtime_ns isn't available, and I get this:

======================================================================
FAIL: runTest (tests.test_mount.FuseSharedTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/src/arvados/services/fuse/tests/test_mount.py", line 247, in runTest
    self.assertEqual(mtime, 1397493304)
AssertionError: 1.3974933040000002 != 1397493304

----------------------------------------------------------------------

I suspect we need to fix a9fa507c505ab702cae850f4ccd5aae26f6218a9 to multiply by 10^9...

#23 Updated by Brett Smith about 5 years ago

run-build-package-all-targets.sh succeeded for me, FWIW.

#24 Updated by Tom Clegg about 5 years ago

mtime issue fixed in b257d00

#25 Updated by Brett Smith about 5 years ago

Current status: I am on arvados b257d006610748cb133c352f9ce95b61d1c66e17, arvados-dev|2b9b7518a60a71315a1504bf96b3182122bec702.

Package building worked fine for me.

Running tests has been a hassle because they frequently hang. It seems like the hang can strike any test: I have seen it happen to tests.test_mount.FuseMvFileBetweenCollectionsTest, tests.test_mount.FuseWriteFileTest, and a third test that was so verbose I didn't catch the name.

% dpkg -l | grep -i fuse
ii  fuse                                                        2.9.3-15+deb8u2                      amd64        Filesystem in Userspace
ii  libfuse-dev                                                 2.9.3-15+deb8u2                      amd64        Filesystem in Userspace (development)
ii  libfuse2:amd64                                              2.9.3-15+deb8u2                      amd64        Filesystem in Userspace (library)

I have to SIGKILL the tests when this happens. SIGINT gets me nowhere.

For unrelated work, I installed the FUSE driver from this branch and tried using it to upload data, running cat newfile >~/keep/home/…/oldfile. That showed me this output, which seems at least potentially interesting:

% arv-mount --read-write --foreground ~/keep/
2016-03-09 16:50:33 arvados.arvados_fuse[27307] ERROR: Unhandled exception during FUSE operation
Traceback (most recent call last):
  File "/home/brett/.local/lib/python2.7/site-packages/arvados_fuse-0.1.20160309185157-py2.7.egg/arvados_fuse/__init__.py", line 277, in catch_exceptions_wrapper
    return orig_func(self, *args, **kwargs)
  File "/home/brett/.local/lib/python2.7/site-packages/arvados_fuse-0.1.20160309185157-py2.7.egg/arvados_fuse/__init__.py", line 397, in on_event
    parent.update()
  File "/home/brett/.local/lib/python2.7/site-packages/arvados_fuse-0.1.20160309185157-py2.7.egg/arvados_fuse/fusefile.py", line 98, in update 
    self._mtime = convertTime(obj['modified_at']) if 'modified_at' in obj else 0
TypeError: argument of type 'NoneType' is not iterable
Exception in thread WebSocketClient:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "build/bdist.linux-x86_64/egg/ws4py/websocket.py", line 427, in run
    if not self.once():
  File "build/bdist.linux-x86_64/egg/ws4py/websocket.py", line 305, in once
    if not self.process(b):
  File "build/bdist.linux-x86_64/egg/ws4py/websocket.py", line 377, in process
    self.received_message(s.message)
  File "/home/brett/.local/lib/python2.7/site-packages/arvados_python_client-0.1.20160307054726-py2.7.egg/arvados/events.py", line 46, in received_message
    self.on_event(json.loads(str(m)))
  File "/home/brett/.local/lib/python2.7/site-packages/arvados_fuse-0.1.20160309185157-py2.7.egg/arvados_fuse/__init__.py", line 290, in catch_exceptions_wrapper
    raise llfuse.FUSEError(errno.EIO)
FUSEError: Input/output error

#26 Updated by Brett Smith about 5 years ago

Apparently my write succeeded despite the unhandled exception.

#27 Updated by Brett Smith about 5 years ago

Ran the tests in arvbox on a different system and reproduced the hang, this time in tests.test_mount.FuseMvFileTest.

% sudo cat /proc/559/stack
[<ffffffff810d192c>] futex_wait_queue_me+0xcc/0x130
[<ffffffff810d1c02>] futex_wait+0x162/0x260
[<ffffffff81078454>] get_signal_to_deliver+0x104/0x5a0
[<ffffffff810d3d82>] do_futex+0xf2/0xb60
[<ffffffff81078325>] ptrace_notify+0x55/0x80
[<ffffffff81020c0f>] tracehook_report_syscall_exit+0x2f/0xa0
[<ffffffff810d485e>] SyS_futex+0x6e/0x150
[<ffffffff81012ea8>] do_notify_resume+0x78/0xa0
[<ffffffff81513ecd>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

Stray thought: We previously saw hangs in arv-mount because llfuse had a limited message queue that could fill and block. arv-mount monkeypatches it now, but is it possible that code no longer works as intended with newer versions of llfuse?

#28 Updated by Joshua Randall about 5 years ago

I did have to change the line that you use to override the Queue, as the internals had changed and `capi` was no longer exposed. I thought we were ok since the tests passed, but maybe there isn't a test for the queue overloaded situation that required that hack in the first place?

https://github.com/curoverse/arvados/compare/master...wtsi-hgi:8345-arvados-fuse-llfuse-0.42#diff-0841ddd54175fd75cca5de647b45e6edL77

#29 Updated by Tom Clegg about 5 years ago

for _notify_queue, see #3198#note-43 (note-43) and 08284382b53f621c09c4ffc87d82fa0261a69d32

FWIW, according to those notes, the "queue full" problem required lots of activity. E.g., it seems unlikely that tests.test_mount.FuseWriteFileTest would have hit it.

So far I've seen only one hang:

runTest (tests.test_mount.FuseNoAPITest) ...

#30 Updated by Tom Clegg about 5 years ago

dd40196 fixes a bug that is surely related to #8345#note-25 (note-25).

I don't quite see why that would manifest as hanging tests, though.

#31 Updated by Peter Amstutz about 5 years ago

Going back over my email exchanges with Nikolaus Rath, I reported the Queue full problem to him and he agreed with my fix and asked me to file an issue to change it. However I can't find an issue on the llfuse Bitbucket repo so I may have failed to do so, and as a result the change never made it in.

#32 Updated by Brett Smith about 5 years ago

Tom Clegg wrote:

dd40196 fixes a bug that is surely related to #8345#note-25 (note-25).

I don't quite see why that would manifest as hanging tests, though.

Unfortunately it doesn't seem to have made any noticeable improvement in my test runs. Still hanging on a variety of tests.

#33 Updated by Brett Smith about 5 years ago

Ideas to try from here:

  • Try removing the debugging additions from the branch, so we're "just" testing llfuse 0.42 support, to try to bisect the problem.
  • Try testing in even more environments: arvbox, different distros, etc.
  • Convert tests using the old filesystem access methods to the new multiprocessing-based methods.
  • Upgrade directly to 1.0.

#34 Updated by Brett Smith about 5 years ago

  • Target version changed from 2016-03-16 sprint to Arvados Future Sprints

#35 Updated by Tom Clegg about 5 years ago

I'm not certain this is related, but I'm using a different dev box than usual, and this test keeps hanging. As with note-25 above, SIGINT and SIGTERM don't help: I have to SIGKILL. Notably, I'm at 02a2277 (from #8879), which does not have the changes from #8345 that we've been blaming these deadlocks on.

test_cache_spill (tests.test_cache.TmpCollectionTest) ... [keep0] 2016/04/06 15:46:27 [[::1]:35127] PUT 16a7cbedc0cf22673283940f5bb78f36 9 0.004485s 0.004473s 0.000013s 200 86 "OK" 
[keep1] 2016/04/06 15:46:27 [[::1]:59524] PUT 16a7cbedc0cf22673283940f5bb78f36 9 0.009449s 0.009432s 0.000016s 200 86 "OK" 
...
[keep0] 2016/04/06 15:46:30 [[::1]:35172] PUT 5b36a2183b880ca29e01bd23559e7080 6 0.003038s 0.003020s 0.000018s 200 86 "OK" 
2016-04-06 15:46:31 arvados.arv-mount[20971] WARNING: Mount.__exit__: llfuse thread still alive 0.100000s after umount -- abandoning and exiting anyway
ok
Sent SIGTERM to 20985 (/home/tom/src/2arvados/tmp/keep0.pid)
[keep0] 2016/04/06 15:46:33 caught signal: terminated
[keep0] 2016/04/06 15:46:33 keepstore exiting, pid 20985
Sent SIGTERM to 20996 (/home/tom/src/2arvados/tmp/keep1.pid)
[keep1] 2016/04/06 15:46:33 caught signal: terminated
[keep1] 2016/04/06 15:46:33 keepstore exiting, pid 20996
test_CreateCollectionWithManyFilesAndMoveAndDeleteFile (tests.performance.test_collection_performance.CreateCollectionWithManyFilesAndMoveAndDeleteFile) ... 

It looks like process 20971 ("setup.py test") is waiting for 21114 to exit, meanwhile 21114 is waiting for 20971 to write to a pipe. I'm guessing this means that the test has finished, and the multiprocessing pool worker isn't exiting properly. Perhaps llfuse/arvados_fuse's use of threads conflicts with the multiprocessing module somehow?

$ pstree -p 20971
python(20971)─┬─python(21114)
              ├─sed(20984)
              ├─sed(20995)
              ├─{python}(21006)
              ├─{python}(21007)
              ├─{python}(21011)
              ├─{python}(21012)
              ├─{python}(21055)
              ├─{python}(21056)
              ├─{python}(21071)
              ├─{python}(21072)
              ├─{python}(21077)
              ├─{python}(21078)
              ├─{python}(21083)
              ├─{python}(21084)
              ├─{python}(21089)
              ├─{python}(21090)
              ├─{python}(21095)
              ├─{python}(21096)
              ├─{python}(21101)
              ├─{python}(21102)
              ├─{python}(21115)
              ├─{python}(21116)
              ├─{python}(21117)
              ├─{python}(21121)
              ├─{python}(21122)
              ├─{python}(21123)
              └─{python}(21124)

$ ls -l /proc/*/fd/* 2>/dev/null | grep 66717
lr-x------ 1 tom tom 64 Apr  6 15:57 /proc/20971/fd/54 -> pipe:[66717]
l-wx------ 1 tom tom 64 Apr  6 15:57 /proc/20971/fd/55 -> pipe:[66717]
lr-x------ 1 tom tom 64 Apr  6 15:57 /proc/21114/fd/54 -> pipe:[66717]

$ strace -f -p 21114
Process 21114 attached - interrupt to quit
read(54, ^C <unfinished ...>
Process 21114 detached

$ strace -f -p 20971
Process 20971 attached with 26 threads - interrupt to quit
[pid 21124] read(59,  <unfinished ...>
...
[pid 20971] futex(0x4adaca0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 21115] <... select resumed> )      = 0 (Timeout)
[pid 21115] wait4(21114, 0x7fbe58e43730, WNOHANG, NULL) = 0
[pid 21115] select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
[pid 21115] wait4(21114, 0x7fbe58e43730, WNOHANG, NULL) = 0
[pid 21115] select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
... (last two lines repeat until ^C)

$ uname -a
Linux belle 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt25-1~bpo70+1 (2016-04-02) x86_64 GNU/Linux

Is it possible we're hitting an old Python problem that occurs when you use both threading and multiprocessing in the same program?

Some of these might be relevant: It might not be a coincidence that the previous test logged "llfuse thread still alive". Perhaps it goes like this:
  1. test N-1 runs
  2. test N-1 has an "llfuse thread still alive", which we log but otherwise ignore
  3. test N uses multiprocessing.Pool to start a process
  4. something goes wrong because there is >1 thread in the main process when multiprocessing.Pool creates its worker(s)
  5. test N hangs during pool.join() as a result

#36 Updated by Brett Smith about 5 years ago

  • Target version changed from Arvados Future Sprints to 2016-04-27 sprint

Tom should pursue his idea above and see if it helps.

#37 Updated by Tom Clegg about 5 years ago

Deadlock seems to occur immediately after "fusermount -u -z mounttmp":
  • This seems to affect (sometimes) any test that uses MountTestBase or IntegrationTest. Some seem to crash more than others, although this might be determined by test order rather than anything meaningful:
    • tests.test_mount.FuseHomeTest
    • tests.test_mount.FuseMagicTest
    • tests.test_mount.FuseMagicTestPDHOnly
    • tests.test_mount.FuseMagicTestPDHOnly
    • tests.test_mount.FuseMagicTestPDHOnly
    • tests.test_mount.FuseHomeTest
    • tests.test_mount.FuseTagsTest
    • tests.test_mount.FuseMountTest
    • tests.test_mount.FuseWriteFileTest
    • tests.test_mount.FuseProjectMkdirRmdirTest
    • tests.test_mount.FuseWriteFileTest
    • tests.test_mount.FuseMountTest
    • tests.test_mount.FuseWriteFileTest
    • tests.test_mount.FuseTagsTest
    • tests.test_mount.FuseProjectMkdirRmdirTest
    • tests.test_mount.FuseMountTest
    • tests.test_mount.FuseMvFileBetweenCollectionsTest
    • tests.test_mount.FuseHomeTest
    • tests.test_mount.FuseProjectMkdirRmdirTest
    • tests.test_mount.FuseProjectMkdirRmdirTest
    • tests.test_retry.RetryPUT
    • tests.test_mount.FuseProjectMkdirRmdirTest
  • Suspect multiprocessing and threading libraries just don't work together reliably. In particular, forking from a multithreaded program is known to be hard to do correctly: the forked child process can inherit mutexes which were locked by other threads when fork() happened, which means they will never be released.
  • This problem deadlocks all threads, not just our main thread. I tried running a "thread snapshot" thread that dumps a stack trace for every running thread, but the deadlock stopped that thread, too, so I couldn't get a snapshot of the deadlock itself.
Things that don't fix the deadlock-on-unmount problem:
  • Running test helpers, and the fusermount command, with a multiprocessing pool in which all workers are started before any other library gets a chance to start any other threads (see 8345-prefill-pool).
  • Shorter or longer timeout when joining the llfuse thread
  • Abandoning the llfuse thread instead of join()ing it
  • time.sleep() before join()ing the llfuse thread
  • more/fewer log messages near the fusermount call
  • time.sleep() before fusermount -u -z
  • fusermount -u instead of fusermount -u -z
Tried to determine whether this is an interaction between multiple tests (and whether it affects only certain classes) using a test that invokes an IntegrationTest 100x, and a test that invokes a MountTestBase 100x. So far,
  • Ran 2x 100x MountTestBase (with no prior IntegrationTest), no hang
  • Ran 1x 100x IntegrationTest (with no prior MountTestBase), HANG (54)
  • Ran 1x 100x IntegrationTest (with prior MountTestBase), no hang
  • Ran 1x 100x MountTestBase (with no prior IntegrationTest), no hang
  • Ran 1x 100x IntegrationTest (with prior MountTestBase), HANG (62)
  • Ran 3x 100x IntegrationTest (with no prior MountTestBase), no hang

This seems to suggest running after an IntegrationTest makes a test more likely to hang.

Replacing llfuse.listdir() with pool.apply(os.listdir, ...) (5ac2b46 on 8345-pool-listdir) seems to increase the incidence of deadlocks (it's hard to be certain). This is true even with the "don't start workers after starting threads" change in 8345-prefill-pool, which suggests the deadlock is related to using multiprocessing workers, not just starting them.

Upgrading to llfuse 1.0 seems to be a matter of dropping "with llfuse.lock" from Operations.destroy(). Presumably, we should do this and skip 0.42 and 0.43.

So, now using llfuse 1.0 but still having deadlock trouble, we're on 8345-llfuse-1.0 at 494f528

#38 Updated by Tom Clegg almost 5 years ago

This might be related.

https://ci.curoverse.com/job/arvados-api-server/2950/consoleText

2016-04-21 19:13:40 arvados.arvados_fuse[2170] ERROR: Unhandled exception during FUSE operation
Traceback (most recent call last):
  File "/data/1/jenkins/workspace/arvados-api-server/services/fuse/arvados_fuse/__init__.py", line 277, in catch_exceptions_wrapper
    return orig_func(self, *args, **kwargs)
  File "/data/1/jenkins/workspace/arvados-api-server/services/fuse/arvados_fuse/__init__.py", line 483, in forget
    ent = self.inodes[inode]
  File "/data/1/jenkins/workspace/arvados-api-server/services/fuse/arvados_fuse/__init__.py", line 215, in __getitem__
    return self._entries[item]
KeyError: 3L
   INIT: 7.18
   flags=0x00000021
   max_readahead=0x00020000
   max_write=0x00020000
   max_background=0
   congestion_threshold=0
   unique: 1, success, outsize: 40
/var/lib/jenkins/workspace/arvados-api-server/build/run-tests.sh: line 477:  2170 Segmentation fault      "${3}python" setup.py test ${testargs[$1]}

https://pythonhosted.org/llfuse/operations.html says forget() is not allowed to raise exceptions. Unfortunately our generic @catch_exceptions handler didn't tell us what the original exception was.

https://github.com/python-llfuse/python-llfuse/blob/master/developer-notes/lookup_counts.rst suggests forget() is supposed to do its own locking:

    def forget(self, inode):
        with lookup_lock:
            lookup_count[inode] -= 1
            if lookup_count[inode] > 0:
                return
            del lookup_count[inode]

        self.maybe_remove_inode(inode)

But we don't seem to do any such locking:

    @catch_exceptions
    def forget(self, inodes):
        if self._shutdown_started.is_set():
            return
        for inode, nlookup in inodes:
            ent = self.inodes[inode]
            _logger.debug("arv-mount forget: inode %i nlookup %i ref_count %i", inode, nlookup, ent.ref_count)
            if ent.dec_ref(nlookup) == 0 and ent.dead:
                self.inodes.del_entry(ent)

OTOH, it looks like it's already called from a "with lock:" block, so it seems odd that we'd need our own locking too: https://github.com/python-llfuse/python-llfuse/blob/master/src/handlers.pxi#L51-L58

cdef void fuse_forget (fuse_req_t req, fuse_ino_t ino,
                       ulong_t nlookup) with gil:
    try:
        with lock:
            operations.forget([(ino, nlookup)])
    except:
        handle_exc(NULL)
    fuse_reply_none(req)

Even if forget() is only called with llfuse.lock acquired, perhaps some other code path accesses the inode cache with llfuse lock released -- and this sort of path is anticipated by the llfuse, motivating the extra locking in the example code?

#39 Updated by Brett Smith almost 5 years ago

Tom Clegg wrote:

https://pythonhosted.org/llfuse/operations.html says forget() is not allowed to raise exceptions. Unfortunately our generic @catch_exceptions handler didn't tell us what the original exception was.

I'm not sure I follow. Why isn't the KeyError you pasted the original exception? It looks like that would've been caught by the bare except: block, and logged with _logger.exception().

#40 Updated by Tom Clegg almost 5 years ago

  • Target version deleted (2016-04-27 sprint)

#41 Updated by Tom Clegg almost 5 years ago

Brett Smith wrote:

I'm not sure I follow. Why isn't the KeyError you pasted the original exception? It looks like that would've been caught by the bare except: block, and logged with _logger.exception().

Aha, you're right. I was just looking at the code, which doesn't pass the caught exception to logger.exception. But logger.exception prefers to figure out the current exception using magic, rather than having one passed to it. So yes, KeyError: 3L really is the original exception.

#42 Updated by Brett Smith almost 5 years ago

Tom Clegg wrote:

But logger.exception prefers to figure out the current exception using magic, rather than having one passed to it.

Right. And the magic is as simple as "call sys.exc_info()," so... not that magical. :)

#43 Updated by Tom Clegg over 4 years ago

Rebased to master (now 8345-llfuse-1.0-rebased). Ran 5 test runs in a row. No deadlock.

I suspect 62d8f575a93374a235097b88499fa65eef2f56ec (from #8460) fixed the deadlock we were seeing here.

#44 Updated by Tom Clegg over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 67 to 100

Applied in changeset arvados|commit:bfe7a62407aa0a6bc078659f259468f9b4105581.

Also available in: Atom PDF