Project

General

Profile

Actions

Bug #18670

closed

flaky test suite.TestSubmit in lib/lsf

Added by Tom Clegg about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
-
Release relationship:
Auto

Subtasks 1 (0 open1 closed)

Task #18673: Review 18670-flaky-lsf-testResolvedTom Clegg01/25/2022Actions
Actions #1

Updated by Tom Clegg about 2 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Tom Clegg about 2 years ago

The "bkill" stub randomly fails 10% of the time, so the 5-second retry delay was occasionally leaving the job in the fake lsf queue longer than the test's 20 second timeout.

Changed the 5-second delay to 1/2 of the configured PollInterval, and shortened some other timeouts in the test case to speed things up.

18670-flaky-lsf-test @ c361e51569e28f30bd034ac240b936346224a0d0 -- developer-run-tests: #2889

Actions #3

Updated by Lucas Di Pentima about 2 years ago

It seems that the tests are still flaky, I've ran the lib/lsf tests from main and also from this branch and they failed in the same rate: 30-50% (the new branch tests ran a lot faster, though!)

My test runs were done in interactive mode, 20 tests at a time. Have you tried something like that on your end?

Actions #4

Updated by Tom Clegg about 2 years ago

Huh. Yes, I have been testing with "20 test lib/lsf" so I thought that must have done it. But I tried another set of 20 just now, and two failed. The successes all take 5s or less, and timeout/failure takes 20s, so I'm suspecting a second/different bug.

Actions #5

Updated by Tom Clegg about 2 years ago

18670-flaky-lsf-test @ 1789aa86c580495f0a722289cec41c4e31872e26 -- developer-run-tests: #2892

This was happening:
  • RunContainer returns an error without finalizing container (e.g., "bsub" fails)
  • start()'s "tracker" goroutine unlocks the container, then deletes its entry in the trackers map
  • Meanwhile (after unlocking, but before deleting the tracker entry):
    • checkListForUpdates() processes a queue update with State=Queued, closes the tracker, and deletes its entry in "trackers"
    • checkListForUpdates() processes a queue update with State=Queued, locks the container, and starts a new tracker
    • the new tracker detects that the container cannot be run, and updates state to Cancelled
  • the old tracker's goroutine therefore mistakenly deletes the new tracker, not the old one
  • the new tracker's channel never receives any updates, and never closes
  • the new tracker's runContainer() waits for an update with state=Cancelled before calling "bkill", which never happens
  • the new tracker's LSF job stays in the LSF queue, which is (correctly) flagged by the test case
The fix:
  • tracker func in start() takes over listening to the "updates" channel after RunContainer() returns -- keeps trying to requeue/cancel the container (depending on RunContainer result) until checkListForUpdates() closes the channel
  • checkListForUpdates() is solely responsible for deleting tracker entries when they are seen to be requeued/cancelled in a queue update (mutex is already in place so it doesn't race with itself)
Actions #6

Updated by Lucas Di Pentima about 2 years ago

This LGTM, tested locally many many times, and didn't got any failure. Thanks!

Actions #7

Updated by Tom Clegg about 2 years ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|c1e7f148bf3340300ae2f41d1ba7588cdfbb3b42.

Actions #8

Updated by Peter Amstutz about 2 years ago

  • Release set to 46
Actions

Also available in: Atom PDF