Project

General

Profile

Actions

Bug #12061

closed

[Node manager] Fix flaky integration tests

Added by Tom Clegg over 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Tests
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

Fix flaky nodemanager tests. Some examples below:

test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/515/consoleText
test_single_node_gce failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/517/consoleText
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/521/consoleText
test_probe_quota failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/512/consoleText
test_single_node_aws failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/522/consoleText
test_single_node_aws failed - https://ci.curoverse.com/job/run-tests-services-nodemanager/526/consoleText

Also this from #9984:

https://ci.curoverse.com/job/run-tests-remainder/102/consoleFull

FAIL: test_shutdown_declined_when_idle_and_job_queued (tests.test_daemon.NodeManagerDaemonActorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/1/jenkins/workspace/run-tests-remainder/services/nodemanager/tests/test_daemon.py", line 499, in test_shutdown_declined_when_idle_and_job_queued
self.assertFalse(self.node_shutdown.start.called)
AssertionError: True is not false


Subtasks 1 (0 open1 closed)

Task #12065: Review 12061-nm-integration-testsClosedTom Clegg08/02/2017Actions

Related issues 1 (0 open1 closed)

Related to Arvados - Bug #9984: [Testing] test_shutdown_declined_when_idle_and_job_queued flakyClosedActions
Actions #1

Updated by Peter Amstutz over 7 years ago

  • Target version changed from 2017-08-02 sprint to 2017-08-16 sprint
Actions #2

Updated by Peter Amstutz over 7 years ago

  • Assigned To set to Peter Amstutz
Actions #3

Updated by Tom Morris over 7 years ago

  • Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint
Actions #4

Updated by Peter Amstutz over 7 years ago

  • Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint
Actions #5

Updated by Peter Amstutz over 7 years ago

  • Target version changed from 2017-09-13 Sprint to Arvados Future Sprints
Actions #6

Updated by Tom Morris about 7 years ago

  • Assigned To deleted (Peter Amstutz)
Actions #7

Updated by Tom Morris almost 7 years ago

  • Description updated (diff)
  • Target version changed from Arvados Future Sprints to 2018-03-28 Sprint
Actions #8

Updated by Tom Morris almost 7 years ago

  • Description updated (diff)
Actions #9

Updated by Tom Morris almost 7 years ago

  • Related to Bug #9984: [Testing] test_shutdown_declined_when_idle_and_job_queued flaky added
Actions #10

Updated by Peter Amstutz almost 7 years ago

  • Assigned To set to Peter Amstutz
Actions #11

Updated by Peter Amstutz almost 7 years ago

  • Assigned To changed from Peter Amstutz to Lucas Di Pentima
Actions #12

Updated by Lucas Di Pentima over 6 years ago

  • Target version changed from 2018-03-28 Sprint to 2018-04-11 Sprint
Actions #13

Updated by Lucas Di Pentima over 6 years ago

  • Status changed from New to In Progress
Actions #14

Updated by Lucas Di Pentima over 6 years ago

  • Target version changed from 2018-04-11 Sprint to 2018-04-25 Sprint
Actions #15

Updated by Lucas Di Pentima over 6 years ago

  • Status changed from In Progress to New
Actions #16

Updated by Tom Morris over 6 years ago

  • Target version changed from 2018-04-25 Sprint to 2018-05-09 Sprint
Actions #17

Updated by Lucas Di Pentima over 6 years ago

  • Status changed from New to In Progress
Actions #18

Updated by Lucas Di Pentima over 6 years ago

This is what I've observed:

  • Lastest sunday reports don’t show a clear pattern in nodemanager's test failures. Integration tests weren't the main failing kind of tests lately so I focused on unit tests.
  • I ran nodemanager’s unit tests locally and got mixed failure modes, sometimes they run ok 30 times in a row, sometimes they fail after a few runs.
  • The most observed piece of code that fails seems to be the busywait() call on test_daemon.py
    • At first, tried different sleep values to see if that made a difference, but didn’t.
    • From these failures, it seems to me that the ones that fail the most are those that pass paired_monitor_count() to busywait()
    • paired_monitor_count() calls monitored_arvados_nodes(False) which ignores ActorDeadError exceptions, so I thought that could be hiding some error but when I ran the tests without that try:except: code, it didn’t seem to make a difference.
  • My suspicion is that monitored_arvados_nodes(False) is not working reliably as of the 41 busywait() calls on test_daemon.py, only 12 use it indirectly via paired_monitor_count(), but most of the failures come from tests that call busywait() with it as an argument.
Actions #19

Updated by Lucas Di Pentima over 6 years ago

  • Target version changed from 2018-05-09 Sprint to 2018-05-23 Sprint
Actions #20

Updated by Lucas Di Pentima over 6 years ago

  • Target version changed from 2018-05-23 Sprint to 2018-06-06 Sprint
Actions #21

Updated by Lucas Di Pentima over 6 years ago

  • Assigned To changed from Lucas Di Pentima to Tom Clegg
Actions #22

Updated by Tom Clegg over 6 years ago

The integration tests have never been reliable, are deathly slow, and don't report failures usefully. I propose we admit that it was a mistake to insert them into our test pipeline, and skip them.

The regular tests seem reliable after this fix:
Actions #23

Updated by Tom Clegg over 6 years ago

Well, nearly reliable. ~1/300 failure rate:

======================================================================
FAIL: test_nonfatal_error (tests.test_failure.ActorUnhandledExceptionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tom/arvados/services/nodemanager/tests/test_failure.py", line 55, in test_nonfatal_error
    self.assertEqual(1, status.tracker.get('actor_exceptions'))
AssertionError: 1 != 0

I expect this is a separate bug.

Actions #24

Updated by Lucas Di Pentima over 6 years ago

Fix at d8dfc75ec5c6cead3da0f3907466ce1b89373b69 lgtm, what a tricky hidden bug!

Actions #25

Updated by Tom Clegg over 6 years ago

12061-flaky-tests @ 2f9761bd08293de1847d3dbd0ab16482eacc414d
  • fixes sporadic test_nonfatal_error failures.
Actions #26

Updated by Tom Clegg over 6 years ago

12061-nm-integration-tests @ 32568bbde8ad549452ccb57f10bb54672fda6fa6

With a few timeout adjustments, the integration tests complete in about 35 seconds (when they don't fail randomly). The biggest slowdowns were
  • 1-minute shutdown window closure. 0 is unworkable because nodes become eligible for shutdown immediately before they can be marked busy. Updated config to allow fractional window sizes like 0.05 minutes.
  • Needlessly long 5-second poll interval in tests. Likewise, fractions are allowed now, tests use 0.5 seconds.
  • Occasional long delay between SIGTERM and actual exit. Now uses SIGKILL instead.

They're still unreliable and fragile, though. I've disabled them in run-tests; you can still run them with --only services/nodemanager_integration.

Suggested requirements for making this suite useful:
  • Use the Python test framework, so they can be invoked the usual way, and --test-suite=... can be used to select individual tests, etc.
  • Name the "action" handlers according to what they do (e.g., a func that changes all nodes to idle state should not be called node_busy()).
  • When starting a nodemanager process, use the nodemanager version in the current test dir, not the last version that got installed in $PATH.
  • Fix underlying racy behavior. The basic problem seems to be an assumption that "NM logs something; test program updates something; NM sees the new environment; NM makes next decision" can work, vs. the reality that NM has many threads whose timing is independent of one another and of the test program's stub updates. For example, shutdown-because-state=down "shouldn't happen" according to the tests, but of course it does sometimes -- the test program doesn't always have time to see "shutdown success" and update the sinfo stub before nodemanager invokes it again, and when nodemanager wins this race, it correctly decides to reattempt shutdown. It seems like there's an endless supply of similar test-killing races here.
Actions #27

Updated by Lucas Di Pentima over 6 years ago

Just.. wow! the speed improvement is great! Very clever approach, this LGTM! thanks so much!

Actions #28

Updated by Tom Clegg over 6 years ago

  • Status changed from In Progress to Resolved
Actions #29

Updated by Tom Morris over 6 years ago

  • Release set to 13
Actions

Also available in: Atom PDF