Project

General

Profile

Actions

Bug #9986

closed

[FUSE] [Testing] arv-mount tests deadlock frequently

Added by Tom Clegg over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
Story points:
-

Subtasks 1 (0 open1 closed)

Task #10017: Review 9986-fuse-test-deadlockResolvedTom Clegg09/13/2016Actions
Actions #1

Updated by Tom Clegg over 7 years ago

It seems that when the llfuse thread doesn't end (which we notice and log), the next test case will/might deadlock.

Actions #2

Updated by Tom Clegg over 7 years ago

  • Category set to FUSE
Actions #3

Updated by Tom Clegg over 7 years ago

Work in progress, 9986-fuse-test-deadlock @ ed8f444
  • avoids calling multiprocessing.Pool.terminate()
  • abandons the entire test suite with "kill -9 self" (rather than risk deadlock) when the llfuse thread doesn't join in a reasonable time after a test
  • uses one shared multiprocessing.Pool(1, maxtasksperchild=1) for all integration tests instead of creating a new pool for each test case
  • sometimes prints a bunch of harmless stderr1 when exiting tests, seemingly due to a multiprocessing shutdown race (this is a bit ugly, but preferable to deadlock)

1

Exception TypeError: TypeError("'NoneType' object does not support item deletion",) in <Finalize object, dead> ignored

Actions #4

Updated by Lucas Di Pentima over 7 years ago

Some questions/observations:

  • File mount_test_base.py:
    • Why adding code to wait for additional 10 secs instead of just waiting up to 11 secs from the beginning? (line 76)
    • If additional 10 secs wait is needed separated from the initial 1 sec: wouldn't be more correct to put the is_alive() check on the same block as the join() call?
  • File integration_test.py:
    • Is the conditional if m.llfuse_thread.is_alive() ever reached by being after the return statement on the context manager block? Maybe I’m missing something about the decorators behaviour. (line 81)
Actions #5

Updated by Tom Clegg over 7 years ago

Lucas Di Pentima wrote:

  • Why adding code to wait for additional 10 secs instead of just waiting up to 11 secs from the beginning? (line 76)

Mainly to expose whether it ever takes >1s to join (as opposed to "always either joins fast, or deadlocks"). But doing one join() and printing the actual time taken would make more sense. Updated.

  • Is the conditional if m.llfuse_thread.is_alive() ever reached by being after the return statement on the context manager block? Maybe I’m missing something about the decorators behaviour. (line 81)

Oops, you're right. Moved this to a "finally". Also fixed the IntegrationTest __enter__ so the "with ... as m" construct actually works as intended.

4cb8583

Actions #6

Updated by Lucas Di Pentima over 7 years ago

LGTM, please merge.

Actions #7

Updated by Tom Clegg over 7 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:503c686bc80825d00980a970af69ec60f9e6ce9b.

Actions #8

Updated by Tom Clegg over 7 years ago

  • Status changed from Resolved to In Progress
  • Assigned To set to Tom Clegg

Reopening to do "retry after deadlock+sigkill".

Actions #9

Updated by Tom Clegg over 7 years ago

9986-fuse-retry-if-killed @ e089b17

Actions #10

Updated by Tom Clegg over 7 years ago

  • Target version set to 2016-09-28 sprint
Actions #11

Updated by Lucas Di Pentima over 7 years ago

Looks good to me. If this is permanent, maybe the function should be called "do_test_once_sometimes()" :)

Actions #12

Updated by Tom Clegg over 7 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:3bf898db1a6f0db043060cd601131b17bd6ef82d.

Actions

Also available in: Atom PDF