Bug #12061
closed[Node manager] Fix flaky integration tests
Description
Fix flaky nodemanager tests. Some examples below:
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/515/consoleText
test_single_node_gce failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/517/consoleText
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/521/consoleText
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/512/consoleText
test_single_node_aws failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/522/consoleText
test_single_node_aws failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/526/consoleText
Also this from #9984:
https://ci.curoverse.com/job/run-tests-remainder/102/consoleFull
FAIL: test_shutdown_declined_when_idle_and_job_queued (tests.test_daemon.NodeManagerDaemonActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/1/jenkins/workspace/run-tests-remainder/services/nodemanager/tests/test_daemon.py", line 499, in test_shutdown_declined_when_idle_and_job_queued
self.assertFalse(self.node_shutdown.start.called)
AssertionError: True is not false
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-09-13 Sprint to Arvados Future Sprints
Updated by Tom Morris almost 7 years ago
- Description updated (diff)
- Target version changed from Arvados Future Sprints to 2018-03-28 Sprint
Updated by Tom Morris almost 7 years ago
- Related to Bug #9984: [Testing] test_shutdown_declined_when_idle_and_job_queued flaky added
Updated by Peter Amstutz almost 7 years ago
- Assigned To changed from Peter Amstutz to Lucas Di Pentima
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-03-28 Sprint to 2018-04-11 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-04-11 Sprint to 2018-04-25 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Status changed from In Progress to New
Updated by Tom Morris over 6 years ago
- Target version changed from 2018-04-25 Sprint to 2018-05-09 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 6 years ago
This is what I've observed:
- Lastest sunday reports don’t show a clear pattern in nodemanager's test failures. Integration tests weren't the main failing kind of tests lately so I focused on unit tests.
- I ran nodemanager’s unit tests locally and got mixed failure modes, sometimes they run ok 30 times in a row, sometimes they fail after a few runs.
- The most observed piece of code that fails seems to be the
busywait()
call ontest_daemon.py
- At first, tried different sleep values to see if that made a difference, but didn’t.
- From these failures, it seems to me that the ones that fail the most are those that pass
paired_monitor_count()
tobusywait()
paired_monitor_count()
callsmonitored_arvados_nodes(False)
which ignoresActorDeadError
exceptions, so I thought that could be hiding some error but when I ran the tests without thattry:except:
code, it didn’t seem to make a difference.
- My suspicion is that
monitored_arvados_nodes(False)
is not working reliably as of the 41busywait()
calls ontest_daemon.py
, only 12 use it indirectly viapaired_monitor_count()
, but most of the failures come from tests that callbusywait()
with it as an argument.
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-05-09 Sprint to 2018-05-23 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Target version changed from 2018-05-23 Sprint to 2018-06-06 Sprint
Updated by Lucas Di Pentima over 6 years ago
- Assigned To changed from Lucas Di Pentima to Tom Clegg
Updated by Tom Clegg over 6 years ago
The integration tests have never been reliable, are deathly slow, and don't report failures usefully. I propose we admit that it was a mistake to insert them into our test pipeline, and skip them.
The regular tests seem reliable after this fix:- 12061-flaky-tests @ d8dfc75ec5c6cead3da0f3907466ce1b89373b69
Updated by Tom Clegg over 6 years ago
Well, nearly reliable. ~1/300 failure rate:
====================================================================== FAIL: test_nonfatal_error (tests.test_failure.ActorUnhandledExceptionTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tom/arvados/services/nodemanager/tests/test_failure.py", line 55, in test_nonfatal_error self.assertEqual(1, status.tracker.get('actor_exceptions')) AssertionError: 1 != 0
I expect this is a separate bug.
Updated by Lucas Di Pentima over 6 years ago
Fix at d8dfc75ec5c6cead3da0f3907466ce1b89373b69 lgtm, what a tricky hidden bug!
Updated by Tom Clegg over 6 years ago
- fixes sporadic test_nonfatal_error failures.
Updated by Tom Clegg over 6 years ago
12061-nm-integration-tests @ 32568bbde8ad549452ccb57f10bb54672fda6fa6
With a few timeout adjustments, the integration tests complete in about 35 seconds (when they don't fail randomly). The biggest slowdowns were- 1-minute shutdown window closure. 0 is unworkable because nodes become eligible for shutdown immediately before they can be marked busy. Updated config to allow fractional window sizes like 0.05 minutes.
- Needlessly long 5-second poll interval in tests. Likewise, fractions are allowed now, tests use 0.5 seconds.
- Occasional long delay between SIGTERM and actual exit. Now uses SIGKILL instead.
They're still unreliable and fragile, though. I've disabled them in run-tests; you can still run them with --only services/nodemanager_integration
.
- Use the Python test framework, so they can be invoked the usual way, and
--test-suite=...
can be used to select individual tests, etc. - Name the "action" handlers according to what they do (e.g., a func that changes all nodes to idle state should not be called node_busy()).
- When starting a nodemanager process, use the nodemanager version in the current test dir, not the last version that got installed in $PATH.
- Fix underlying racy behavior. The basic problem seems to be an assumption that "NM logs something; test program updates something; NM sees the new environment; NM makes next decision" can work, vs. the reality that NM has many threads whose timing is independent of one another and of the test program's stub updates. For example, shutdown-because-state=down "shouldn't happen" according to the tests, but of course it does sometimes -- the test program doesn't always have time to see "shutdown success" and update the sinfo stub before nodemanager invokes it again, and when nodemanager wins this race, it correctly decides to reattempt shutdown. It seems like there's an endless supply of similar test-killing races here.
Updated by Lucas Di Pentima over 6 years ago
Just.. wow! the speed improvement is great! Very clever approach, this LGTM! thanks so much!
Updated by Tom Clegg over 6 years ago
- Status changed from In Progress to Resolved