Bug #9986

[FUSE] [Testing] arv-mount tests deadlock frequently

Added by Tom Clegg over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
FUSE
Target version:
Start date:
09/13/2016
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Subtasks

Task #10017: Review 9986-fuse-test-deadlockResolvedTom Clegg

Associated revisions

Revision 503c686b
Added by Tom Clegg over 4 years ago

Merge branch '9986-fuse-test-deadlock' closes #9986

Revision 3bf898db
Added by Tom Clegg over 4 years ago

Merge branch '9986-fuse-retry-if-killed' closes #9986

History

#1 Updated by Tom Clegg over 4 years ago

It seems that when the llfuse thread doesn't end (which we notice and log), the next test case will/might deadlock.

#2 Updated by Tom Clegg over 4 years ago

  • Category set to FUSE

#3 Updated by Tom Clegg over 4 years ago

Work in progress, 9986-fuse-test-deadlock @ ed8f444
  • avoids calling multiprocessing.Pool.terminate()
  • abandons the entire test suite with "kill -9 self" (rather than risk deadlock) when the llfuse thread doesn't join in a reasonable time after a test
  • uses one shared multiprocessing.Pool(1, maxtasksperchild=1) for all integration tests instead of creating a new pool for each test case
  • sometimes prints a bunch of harmless stderr1 when exiting tests, seemingly due to a multiprocessing shutdown race (this is a bit ugly, but preferable to deadlock)

1

Exception TypeError: TypeError("'NoneType' object does not support item deletion",) in <Finalize object, dead> ignored

#4 Updated by Lucas Di Pentima over 4 years ago

Some questions/observations:

  • File mount_test_base.py:
    • Why adding code to wait for additional 10 secs instead of just waiting up to 11 secs from the beginning? (line 76)
    • If additional 10 secs wait is needed separated from the initial 1 sec: wouldn't be more correct to put the is_alive() check on the same block as the join() call?
  • File integration_test.py:
    • Is the conditional if m.llfuse_thread.is_alive() ever reached by being after the return statement on the context manager block? Maybe I’m missing something about the decorators behaviour. (line 81)

#5 Updated by Tom Clegg over 4 years ago

Lucas Di Pentima wrote:

  • Why adding code to wait for additional 10 secs instead of just waiting up to 11 secs from the beginning? (line 76)

Mainly to expose whether it ever takes >1s to join (as opposed to "always either joins fast, or deadlocks"). But doing one join() and printing the actual time taken would make more sense. Updated.

  • Is the conditional if m.llfuse_thread.is_alive() ever reached by being after the return statement on the context manager block? Maybe I’m missing something about the decorators behaviour. (line 81)

Oops, you're right. Moved this to a "finally". Also fixed the IntegrationTest __enter__ so the "with ... as m" construct actually works as intended.

4cb8583

#6 Updated by Lucas Di Pentima over 4 years ago

LGTM, please merge.

#7 Updated by Tom Clegg over 4 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:503c686bc80825d00980a970af69ec60f9e6ce9b.

#8 Updated by Tom Clegg over 4 years ago

  • Status changed from Resolved to In Progress
  • Assigned To set to Tom Clegg

Reopening to do "retry after deadlock+sigkill".

#9 Updated by Tom Clegg over 4 years ago

9986-fuse-retry-if-killed @ e089b17

#10 Updated by Tom Clegg over 4 years ago

  • Target version set to 2016-09-28 sprint

#11 Updated by Lucas Di Pentima over 4 years ago

Looks good to me. If this is permanent, maybe the function should be called "do_test_once_sometimes()" :)

#12 Updated by Tom Clegg over 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:3bf898db1a6f0db043060cd601131b17bd6ef82d.

Also available in: Atom PDF