Bug #12061

[Node manager] Fix flaky integration tests

Added by Tom Clegg about 2 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Tests
Target version:
Start date:
08/02/2017
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release:
Release relationship:
Auto

Description

Fix flaky nodemanager tests. Some examples below:

test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/515/consoleText
test_single_node_gce failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/517/consoleText
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/521/consoleText
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/512/consoleText
test_single_node_aws failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/522/consoleText
test_single_node_aws failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/526/consoleText

Also this from #9984:

https://ci.curoverse.com/job/run-tests-remainder/102/consoleFull

FAIL: test_shutdown_declined_when_idle_and_job_queued (tests.test_daemon.NodeManagerDaemonActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/1/jenkins/workspace/run-tests-remainder/services/nodemanager/tests/test_daemon.py", line 499, in test_shutdown_declined_when_idle_and_job_queued
self.assertFalse(self.node_shutdown.start.called)
AssertionError: True is not false


Subtasks

Task #12065: Review 12061-nm-integration-testsClosedTom Clegg


Related issues

Related to Arvados - Bug #9984: [Testing] test_shutdown_declined_when_idle_and_job_queued flakyClosed

Associated revisions

Revision 08f0ba55
Added by Tom Clegg about 1 year ago

Merge branch '12061-flaky-tests'

refs #12061

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 78352324
Added by Tom Clegg about 1 year ago

Merge branch '12061-nm-integration-tests'

refs #12061

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

History

#1 Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2017-08-02 sprint to 2017-08-16 sprint

#2 Updated by Peter Amstutz about 2 years ago

  • Assigned To set to Peter Amstutz

#3 Updated by Tom Morris about 2 years ago

  • Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint

#4 Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint

#5 Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2017-09-13 Sprint to Arvados Future Sprints

#6 Updated by Tom Morris almost 2 years ago

  • Assigned To deleted (Peter Amstutz)

#7 Updated by Tom Morris over 1 year ago

  • Description updated (diff)
  • Target version changed from Arvados Future Sprints to 2018-03-28 Sprint

#8 Updated by Tom Morris over 1 year ago

  • Description updated (diff)

#9 Updated by Tom Morris over 1 year ago

  • Related to Bug #9984: [Testing] test_shutdown_declined_when_idle_and_job_queued flaky added

#10 Updated by Peter Amstutz over 1 year ago

  • Assigned To set to Peter Amstutz

#11 Updated by Peter Amstutz over 1 year ago

  • Assigned To changed from Peter Amstutz to Lucas Di Pentima

#12 Updated by Lucas Di Pentima over 1 year ago

  • Target version changed from 2018-03-28 Sprint to 2018-04-11 Sprint

#13 Updated by Lucas Di Pentima over 1 year ago

  • Status changed from New to In Progress

#14 Updated by Lucas Di Pentima over 1 year ago

  • Target version changed from 2018-04-11 Sprint to 2018-04-25 Sprint

#15 Updated by Lucas Di Pentima over 1 year ago

  • Status changed from In Progress to New

#16 Updated by Tom Morris over 1 year ago

  • Target version changed from 2018-04-25 Sprint to 2018-05-09 Sprint

#17 Updated by Lucas Di Pentima over 1 year ago

  • Status changed from New to In Progress

#18 Updated by Lucas Di Pentima over 1 year ago

This is what I've observed:

  • Lastest sunday reports don’t show a clear pattern in nodemanager's test failures. Integration tests weren't the main failing kind of tests lately so I focused on unit tests.
  • I ran nodemanager’s unit tests locally and got mixed failure modes, sometimes they run ok 30 times in a row, sometimes they fail after a few runs.
  • The most observed piece of code that fails seems to be the busywait() call on test_daemon.py
    • At first, tried different sleep values to see if that made a difference, but didn’t.
    • From these failures, it seems to me that the ones that fail the most are those that pass paired_monitor_count() to busywait()
    • paired_monitor_count() calls monitored_arvados_nodes(False) which ignores ActorDeadError exceptions, so I thought that could be hiding some error but when I ran the tests without that try:except: code, it didn’t seem to make a difference.
  • My suspicion is that monitored_arvados_nodes(False) is not working reliably as of the 41 busywait() calls on test_daemon.py, only 12 use it indirectly via paired_monitor_count(), but most of the failures come from tests that call busywait() with it as an argument.

#19 Updated by Lucas Di Pentima over 1 year ago

  • Target version changed from 2018-05-09 Sprint to 2018-05-23 Sprint

#20 Updated by Lucas Di Pentima over 1 year ago

  • Target version changed from 2018-05-23 Sprint to 2018-06-06 Sprint

#21 Updated by Lucas Di Pentima over 1 year ago

  • Assigned To changed from Lucas Di Pentima to Tom Clegg

#22 Updated by Tom Clegg about 1 year ago

The integration tests have never been reliable, are deathly slow, and don't report failures usefully. I propose we admit that it was a mistake to insert them into our test pipeline, and skip them.

The regular tests seem reliable after this fix:

#23 Updated by Tom Clegg about 1 year ago

Well, nearly reliable. ~1/300 failure rate:

======================================================================
FAIL: test_nonfatal_error (tests.test_failure.ActorUnhandledExceptionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/arvados/services/nodemanager/tests/test_failure.py", line 55, in test_nonfatal_error
    self.assertEqual(1, status.tracker.get('actor_exceptions'))
AssertionError: 1 != 0

I expect this is a separate bug.

#24 Updated by Lucas Di Pentima about 1 year ago

Fix at d8dfc75ec5c6cead3da0f3907466ce1b89373b69 lgtm, what a tricky hidden bug!

#25 Updated by Tom Clegg about 1 year ago

12061-flaky-tests @ 2f9761bd08293de1847d3dbd0ab16482eacc414d
  • fixes sporadic test_nonfatal_error failures.

#26 Updated by Tom Clegg about 1 year ago

12061-nm-integration-tests @ 32568bbde8ad549452ccb57f10bb54672fda6fa6

With a few timeout adjustments, the integration tests complete in about 35 seconds (when they don't fail randomly). The biggest slowdowns were
  • 1-minute shutdown window closure. 0 is unworkable because nodes become eligible for shutdown immediately before they can be marked busy. Updated config to allow fractional window sizes like 0.05 minutes.
  • Needlessly long 5-second poll interval in tests. Likewise, fractions are allowed now, tests use 0.5 seconds.
  • Occasional long delay between SIGTERM and actual exit. Now uses SIGKILL instead.

They're still unreliable and fragile, though. I've disabled them in run-tests; you can still run them with --only services/nodemanager_integration.

Suggested requirements for making this suite useful:
  • Use the Python test framework, so they can be invoked the usual way, and --test-suite=... can be used to select individual tests, etc.
  • Name the "action" handlers according to what they do (e.g., a func that changes all nodes to idle state should not be called node_busy()).
  • When starting a nodemanager process, use the nodemanager version in the current test dir, not the last version that got installed in $PATH.
  • Fix underlying racy behavior. The basic problem seems to be an assumption that "NM logs something; test program updates something; NM sees the new environment; NM makes next decision" can work, vs. the reality that NM has many threads whose timing is independent of one another and of the test program's stub updates. For example, shutdown-because-state=down "shouldn't happen" according to the tests, but of course it does sometimes -- the test program doesn't always have time to see "shutdown success" and update the sinfo stub before nodemanager invokes it again, and when nodemanager wins this race, it correctly decides to reattempt shutdown. It seems like there's an endless supply of similar test-killing races here.

#27 Updated by Lucas Di Pentima about 1 year ago

Just.. wow! the speed improvement is great! Very clever approach, this LGTM! thanks so much!

#28 Updated by Tom Clegg about 1 year ago

  • Status changed from In Progress to Resolved

#29 Updated by Tom Morris about 1 year ago

  • Release set to 13

Also available in: Atom PDF