Project

General

Profile

Actions

Bug #12134

closed

[arv-mount] Fix test deadlock by using new llfuse in test suite

Added by Tom Clegg over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
Story points:
-

Description

Peter's patch for the thread-cancel bug has been merged upstream, but there is no new release yet.

Instead of waiting for a new release, we can update run-tests.sh to build its own version ("1.2.1arvados1"?) from source and install it into $VENVDIR during the dependencies/setup phase. Then "python setup.py install" in services/fuse should use that custom version instead of going to pypi, and we should see no more test suite deadlocks.

This is the relevant fix:

https://bitbucket.org/nikratio/python-llfuse/commits/8aab6579089bb0b07423a41bca84c4654b2f9b81


Subtasks 2 (0 open2 closed)

Task #12143: Exercise test suite with patched llfuseResolvedTom Clegg08/16/2017Actions
Task #12187: Review 12134-llfuse-patchResolvedTom Clegg08/16/2017Actions

Related issues 2 (0 open2 closed)

Related to Arvados - Idea #8345: [FUSE] Support llfuse 0.42+ResolvedTom Clegg02/03/2016Actions
Related to Arvados - Bug #10805: [FUSE] Upgrade llfuse to 1.2, fix deadlock in test suiteResolvedTom Clegg01/04/2017Actions
Actions #1

Updated by Tom Clegg over 7 years ago

Even with new llfuse, test_replace (tests.test_unmount.UnmountTest) hangs sometimes, but it wakes up if I look in /proc/self/mountinfo for stray fuse mounts and hit them with "echo 1 > /sys/fs/fuse/connections/NNN/abort".

Actions #2

Updated by Tom Clegg over 7 years ago

  • Status changed from New to In Progress
Actions #3

Updated by Tom Clegg over 7 years ago

Tests hung on test_tmp_rewrite (tests.test_tmp_collection.TmpCollectionTest). Destroying fuse mounts had no effect, but SIGKILLing the innermost python program revived the test run.

Again in test_tmp_snapshots (tests.test_tmp_collection.TmpCollectionTest).

14694 tom       20   0   11896   3872   2900 S   0.0  0.1   0:00.08                          `- run-tests.sh
 9248 tom       20   0 2728408 202196  15776 S   0.3  5.0   2:43.15                              `- python              
 9404 tom       20   0   16956   1044    916 S   0.0  0.0   0:00.01                                  `- sed
 9405 tom       20   0  497612  11512   5820 S   0.0  0.3   0:02.04                                  `- keepstore
 9416 tom       20   0   16956   1080    948 S   0.0  0.0   0:00.00                                  `- sed
 9417 tom       20   0  503440   9632   5928 S   0.0  0.2   0:01.86                                  `- keepstore
 4799 tom       20   0 2497948 186832   2860 S   0.0  4.6   0:00.00                                  `- python

9396 (one of 9248's threads) is doing waitpid(4799, ...), sleep 100ms, repeat.

4799 is waiting in futex().

Killing 4799 just resulted in a new process in the same place doing the same thing, whether or not I had aborted the fuse mounts. Had to kill 9248 to abort the test suite. Judging by strace, 9396 was the multiprocessing module maintaining its 1-worker pool. I suspect this behavior (multiprocessing keeping an idle worker pool alive) is fine, the problem is just that some other thread is deadlocked. (But it is possible that the way multiprocessing keeps a worker pool alive is helping that other thread get deadlocked.)

Actions #4

Updated by Tom Clegg over 7 years ago

----------------------------------------------------------------------
Ran 83 tests in 292.678s

OK
Exception in thread Thread-83:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 366, in _handle_tasks
    debug('task handler got sentinel')
TypeError: 'NoneType' object is not callable

Exception in thread Thread-82:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 330, in _handle_workers
    debug('worker handler exiting')
TypeError: 'NoneType' object is not callable

/home/tom/src/arvados/build/run-tests.sh: line 574: 26482 Segmentation fault      "${3}python" setup.py ${short:+--short-tests-only} test ${testargs[$1]}
Actions #5

Updated by Tom Clegg over 7 years ago

at commit:7c92bc77b, ran fuse tests 50 times in a row with no deadlocks, interrupted by this (non-deadlocking) failure:

======================================================================
FAIL: runTest (tests.test_mount.FuseModifyFileTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/src/arvados/services/fuse/tests/test_mount.py", line 348, in runTest
    self.pool.apply(fuseModifyFileTestHelperReadEndContents, (self.mounttmp,))
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 244, in apply
    return self.apply_async(func, args, kwds).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
AssertionError: 'plnp' != 'blub'

----------------------------------------------------------------------
Ran 83 tests in 317.388s

12134-llfuse-patch @ 99c19e1539aabb8053ee9221f62744bf76d63737

Actions #6

Updated by Tom Clegg over 7 years ago

  • Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint
Actions #7

Updated by Lucas Di Pentima over 7 years ago

Tried to run the tests repeatedly on my VM, but llfuse requires cython 0.24+ and Debian Jessie doesn't have it.
The updates LGTM, but I don't know if this would create problems with the rest of the team, for example I know Peter runs tests using arvbox.

Actions #8

Updated by Tom Clegg over 7 years ago

  • Status changed from In Progress to Feedback
Actions #9

Updated by Tom Clegg over 7 years ago

We'll have to update arvbox to stretch to make it pass tests.

Actions #10

Updated by Tom Clegg over 7 years ago

  • Status changed from Feedback to Resolved

No test deadlocks since merge.

Actions

Also available in: Atom PDF