Bug #12061
closed
- Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
- Assigned To set to Peter Amstutz
- Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint
- Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint
- Target version changed from 2017-09-13 Sprint to Arvados Future Sprints
- Assigned To deleted (
Peter Amstutz)
- Description updated (diff)
- Target version changed from Arvados Future Sprints to 2018-03-28 Sprint
- Description updated (diff)
- Related to Bug #9984: [Testing] test_shutdown_declined_when_idle_and_job_queued flaky added
- Assigned To set to Peter Amstutz
- Assigned To changed from Peter Amstutz to Lucas Di Pentima
- Target version changed from 2018-03-28 Sprint to 2018-04-11 Sprint
- Status changed from New to In Progress
- Target version changed from 2018-04-11 Sprint to 2018-04-25 Sprint
- Status changed from In Progress to New
- Target version changed from 2018-04-25 Sprint to 2018-05-09 Sprint
- Status changed from New to In Progress
This is what I've observed:
- Lastest sunday reports don’t show a clear pattern in nodemanager's test failures. Integration tests weren't the main failing kind of tests lately so I focused on unit tests.
- I ran nodemanager’s unit tests locally and got mixed failure modes, sometimes they run ok 30 times in a row, sometimes they fail after a few runs.
- The most observed piece of code that fails seems to be the
busywait()
call on test_daemon.py
- At first, tried different sleep values to see if that made a difference, but didn’t.
- From these failures, it seems to me that the ones that fail the most are those that pass
paired_monitor_count()
to busywait()
paired_monitor_count()
calls monitored_arvados_nodes(False)
which ignores ActorDeadError
exceptions, so I thought that could be hiding some error but when I ran the tests without that try:except:
code, it didn’t seem to make a difference.
- My suspicion is that
monitored_arvados_nodes(False)
is not working reliably as of the 41 busywait()
calls on test_daemon.py
, only 12 use it indirectly via paired_monitor_count()
, but most of the failures come from tests that call busywait()
with it as an argument.
- Target version changed from 2018-05-09 Sprint to 2018-05-23 Sprint
- Target version changed from 2018-05-23 Sprint to 2018-06-06 Sprint
- Assigned To changed from Lucas Di Pentima to Tom Clegg
The integration tests have never been reliable, are deathly slow, and don't report failures usefully. I propose we admit that it was a mistake to insert them into our test pipeline, and skip them.
The regular tests seem reliable after this fix:
Well, nearly reliable. ~1/300 failure rate:
======================================================================
FAIL: test_nonfatal_error (tests.test_failure.ActorUnhandledExceptionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/tom/arvados/services/nodemanager/tests/test_failure.py", line 55, in test_nonfatal_error
self.assertEqual(1, status.tracker.get('actor_exceptions'))
AssertionError: 1 != 0
I expect this is a separate bug.
12061-nm-integration-tests @ 32568bbde8ad549452ccb57f10bb54672fda6fa6
With a few timeout adjustments, the integration tests complete in about 35 seconds (when they don't fail randomly). The biggest slowdowns were
- 1-minute shutdown window closure. 0 is unworkable because nodes become eligible for shutdown immediately before they can be marked busy. Updated config to allow fractional window sizes like 0.05 minutes.
- Needlessly long 5-second poll interval in tests. Likewise, fractions are allowed now, tests use 0.5 seconds.
- Occasional long delay between SIGTERM and actual exit. Now uses SIGKILL instead.
They're still unreliable and fragile, though. I've disabled them in run-tests; you can still run them with --only services/nodemanager_integration
.
Suggested requirements for making this suite useful:
- Use the Python test framework, so they can be invoked the usual way, and
--test-suite=...
can be used to select individual tests, etc.
- Name the "action" handlers according to what they do (e.g., a func that changes all nodes to idle state should not be called node_busy()).
- When starting a nodemanager process, use the nodemanager version in the current test dir, not the last version that got installed in $PATH.
- Fix underlying racy behavior. The basic problem seems to be an assumption that "NM logs something; test program updates something; NM sees the new environment; NM makes next decision" can work, vs. the reality that NM has many threads whose timing is independent of one another and of the test program's stub updates. For example, shutdown-because-state=down "shouldn't happen" according to the tests, but of course it does sometimes -- the test program doesn't always have time to see "shutdown success" and update the sinfo stub before nodemanager invokes it again, and when nodemanager wins this race, it correctly decides to reattempt shutdown. It seems like there's an endless supply of similar test-killing races here.
Just.. wow! the speed improvement is great! Very clever approach, this LGTM! thanks so much!
- Status changed from In Progress to Resolved
Also available in: Atom
PDF