[Node manager] Fix flaky integration tests
Fix flaky nodemanager tests. Some examples below:
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/515/consoleText
test_single_node_gce failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/517/consoleText
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/521/consoleText
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/512/consoleText
test_single_node_aws failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/522/consoleText
test_single_node_aws failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/526/consoleText
Also this from #9984:
FAIL: test_shutdown_declined_when_idle_and_job_queued (tests.test_daemon.NodeManagerDaemonActorTestCase)
Traceback (most recent call last):
File "/data/1/jenkins/workspace/run-tests-remainder/services/nodemanager/tests/test_daemon.py", line 499, in test_shutdown_declined_when_idle_and_job_queued
AssertionError: True is not false
#18 Updated by Lucas Di Pentima over 1 year ago
This is what I've observed:
- Lastest sunday reports don’t show a clear pattern in nodemanager's test failures. Integration tests weren't the main failing kind of tests lately so I focused on unit tests.
- I ran nodemanager’s unit tests locally and got mixed failure modes, sometimes they run ok 30 times in a row, sometimes they fail after a few runs.
- The most observed piece of code that fails seems to be the
- At first, tried different sleep values to see if that made a difference, but didn’t.
- From these failures, it seems to me that the ones that fail the most are those that pass
ActorDeadErrorexceptions, so I thought that could be hiding some error but when I ran the tests without that
try:except:code, it didn’t seem to make a difference.
- My suspicion is that
monitored_arvados_nodes(False)is not working reliably as of the 41
test_daemon.py, only 12 use it indirectly via
paired_monitor_count(), but most of the failures come from tests that call
busywait()with it as an argument.
#22 Updated by Tom Clegg over 1 year ago
The integration tests have never been reliable, are deathly slow, and don't report failures usefully. I propose we admit that it was a mistake to insert them into our test pipeline, and skip them.The regular tests seem reliable after this fix:
- 12061-flaky-tests @ d8dfc75ec5c6cead3da0f3907466ce1b89373b69
#23 Updated by Tom Clegg over 1 year ago
Well, nearly reliable. ~1/300 failure rate:
====================================================================== FAIL: test_nonfatal_error (tests.test_failure.ActorUnhandledExceptionTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/tom/arvados/services/nodemanager/tests/test_failure.py", line 55, in test_nonfatal_error self.assertEqual(1, status.tracker.get('actor_exceptions')) AssertionError: 1 != 0
I expect this is a separate bug.
#26 Updated by Tom Clegg over 1 year ago
12061-nm-integration-tests @ 32568bbde8ad549452ccb57f10bb54672fda6fa6With a few timeout adjustments, the integration tests complete in about 35 seconds (when they don't fail randomly). The biggest slowdowns were
- 1-minute shutdown window closure. 0 is unworkable because nodes become eligible for shutdown immediately before they can be marked busy. Updated config to allow fractional window sizes like 0.05 minutes.
- Needlessly long 5-second poll interval in tests. Likewise, fractions are allowed now, tests use 0.5 seconds.
- Occasional long delay between SIGTERM and actual exit. Now uses SIGKILL instead.
They're still unreliable and fragile, though. I've disabled them in run-tests; you can still run them with
- Use the Python test framework, so they can be invoked the usual way, and
--test-suite=...can be used to select individual tests, etc.
- Name the "action" handlers according to what they do (e.g., a func that changes all nodes to idle state should not be called node_busy()).
- When starting a nodemanager process, use the nodemanager version in the current test dir, not the last version that got installed in $PATH.
- Fix underlying racy behavior. The basic problem seems to be an assumption that "NM logs something; test program updates something; NM sees the new environment; NM makes next decision" can work, vs. the reality that NM has many threads whose timing is independent of one another and of the test program's stub updates. For example, shutdown-because-state=down "shouldn't happen" according to the tests, but of course it does sometimes -- the test program doesn't always have time to see "shutdown success" and update the sinfo stub before nodemanager invokes it again, and when nodemanager wins this race, it correctly decides to reattempt shutdown. It seems like there's an endless supply of similar test-killing races here.