Bug #11925

[Nodemanager] Fix unit tests

Added by Peter Amstutz over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
07/28/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Subtasks

Task #12001: Review 11925-nodemanager-watchdog-testResolvedPeter Amstutz

Associated revisions

Revision 0413abf9
Added by Peter Amstutz over 3 years ago

Merge branch '11925-nodemanager-watchdog-test' refs #11925

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <>

History

#1 Updated by Tom Morris over 3 years ago

  • Assigned To set to Peter Amstutz

#2 Updated by Tom Morris over 3 years ago

  • Target version changed from 2017-07-19 sprint to 2017-08-02 sprint

#3 Updated by Peter Amstutz over 3 years ago

11925-nodemanager-watchdog-test @ 97bb15198ff6071d656d461b27e1055d84826d36

#4 Updated by Radhika Chippada over 3 years ago

The WatchdogActorTest still fails for me in my dev env, as before (this is the only nodemanager test that always fails for me locally) :

======================================================================
FAIL: test_time_timout (tests.test_failure.WatchdogActorTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/tmp.ZsPoCMCUKb/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
return func(*args, **keywargs)
File "/home/rc/arvados/services/nodemanager/tests/test_failure.py", line 58, in test_time_timout
self.assertTrue(kill_mock.called)
AssertionError: False is not true

#5 Updated by Peter Amstutz over 3 years ago

11925-nodemanager-watchdog-test @ f313294f95e55f595ace70e2a614557c0428f2da

The fix (adding an extra wait to the test) is a bit of a hack but it is the only thing I've tried that seems to work.

#6 Updated by Lucas Di Pentima over 3 years ago

I don't know if it's related, but I'm seeing this test fail almost every run:

======================================================================
ERROR: test_arvados_node_not_cleaned_after_shutdown_cancelled (tests.test_computenode_dispatch_slurm.SLURMComputeNodeShutdownActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 244, in test_arvados_node_not_cleaned_after_shutdown_cancelled
    self.check_success_flag(False, 2)
  File "/home/lucas/arvados_local/services/nodemanager/tests/test_computenode_dispatch.py", line 200, in check_success_flag
    last_flag = self.shutdown_actor.success.get(self.TIMEOUT)
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/threading.py", line 52, in get
    compat.reraise(*self._data['exc_info'])
  File "/home/lucas/arvados_local/tmp/VENVDIR/local/lib/python2.7/site-packages/pykka/compat.py", line 12, in reraise
    exec('raise tp, value, tb')
  File "<string>", line 1, in <module>
ActorDeadError: ComputeNodeShutdownActor (urn:uuid:21137c1c-a77b-4d58-afd0-749333685eba) stopped before handling the message

#7 Updated by Peter Amstutz over 3 years ago

Some structural reasons for failing node manager tests:

  1. Tests were written with the assumption that certain communications between actors (threads) was synchronous, which provided some sequencing. This assumption changed in #8543 which changed the majority of messaging from synchronous to asynchronous.
  2. Code which relies on changing the behavior of mocks on the fly has to be carefully synchronized to ensure that it applies without racing with the code that's about to call the mock.

#8 Updated by Peter Amstutz over 3 years ago

11925-nodemanager-watchdog-test @ 0ac98ea67157ab1a6d92b02e59b8491d90dd1f79

Fixes flaky tests in test_computenode_dispatch_slurm. (Passed 30 times in a row with no failures).

#9 Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2017-08-02 sprint to 2017-08-16 sprint

#10 Updated by Peter Amstutz over 3 years ago

  • Status changed from New to In Progress

#11 Updated by Peter Amstutz over 3 years ago

  • Subject changed from [Nodemanager] Fix watchdog test to [Nodemanager] Fix unit tests
======================================================================
ERROR: test_arvados_node_not_cleaned_after_shutdown_cancelled (tests.test_computenode_dispatch.ComputeNodeShutdownActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/src/arvados/services/nodemanager/tests/test_computenode_dispatch.py", line 245, in test_arvados_node_not_cleaned_after_shutdown_cancelled
    self.check_success_flag(False, 2)
  File "/usr/src/arvados/services/nodemanager/tests/test_computenode_dispatch.py", line 200, in check_success_flag
    last_flag = self.shutdown_actor.success.get(self.TIMEOUT)
  File "/var/lib/arvados/test/VENVDIR/local/lib/python2.7/site-packages/pykka/threading.py", line 52, in get
    compat.reraise(*self._data['exc_info'])
  File "/var/lib/arvados/test/VENVDIR/local/lib/python2.7/site-packages/pykka/compat.py", line 12, in reraise
    exec('raise tp, value, tb')
  File "/var/lib/arvados/test/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 431, in ask
    self.tell(message)
  File "/var/lib/arvados/test/VENVDIR/local/lib/python2.7/site-packages/pykka/actor.py", line 398, in tell
    raise ActorDeadError('%s not found' % self)
ActorDeadError: ComputeNodeShutdownActor (urn:uuid:2495c712-d4fe-4801-8f7d-2a17afef3d25) not found

======================================================================
ERROR: test_arvados_node_not_cleaned_after_shutdown_cancelled (tests.test_computenode_dispatch_slurm.SLURMComputeNodeShutdownActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/arvados/test/VENVDIR/local/lib/python2.7/site-packages/mock/mock.py", line 1305, in patched
    return func(*args, **keywargs)
  File "/usr/src/arvados/services/nodemanager/tests/test_computenode_dispatch.py", line 244, in test_arvados_node_not_cleaned_after_shutdown_cancelled
    self.shutdown_actor.ping().get(self.TIMEOUT)
  File "/var/lib/arvados/test/VENVDIR/local/lib/python2.7/site-packages/pykka/threading.py", line 52, in get
    compat.reraise(*self._data['exc_info'])
  File "/var/lib/arvados/test/VENVDIR/local/lib/python2.7/site-packages/pykka/compat.py", line 12, in reraise
    exec('raise tp, value, tb')
  File "<string>", line 1, in <module>
ActorDeadError: ComputeNodeShutdownActor (urn:uuid:04717cf8-b0f1-4def-87ed-a7ff786a83c4) stopped before handling the message

#12 Updated by Peter Amstutz over 3 years ago

Another systemic issue:

Python mocks seem to be unreliable when shadowing builtin functions (like time.time()) with mock.patch() and being accessed across threads. They sometimes get arbitrarily reset back to their original values, despite the fact that the mock teardown shouldn't have executed yet. The solution seems to be to pass through an explicit mock function instead of relying on mock.patch().

#13 Updated by Lucas Di Pentima over 3 years ago

Updates @ 597b742a6 LGTM.

Some tests at test_daemon.py are failing once in a while, the rest seem to be reliable.

#14 Updated by Tom Morris over 3 years ago

  • Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint

#15 Updated by Peter Amstutz over 3 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF