Bug #18670

flaky test suite.TestSubmit in lib/lsf

Added by Tom Clegg 4 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
01/25/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Subtasks

Task #18673: Review 18670-flaky-lsf-testResolvedTom Clegg

History

#1 Updated by Tom Clegg 4 months ago

  • Status changed from New to In Progress

#2 Updated by Tom Clegg 4 months ago

The "bkill" stub randomly fails 10% of the time, so the 5-second retry delay was occasionally leaving the job in the fake lsf queue longer than the test's 20 second timeout.

Changed the 5-second delay to 1/2 of the configured PollInterval, and shortened some other timeouts in the test case to speed things up.

18670-flaky-lsf-test @ c361e51569e28f30bd034ac240b936346224a0d0 -- developer-run-tests: #2889

#3 Updated by Lucas Di Pentima 4 months ago

It seems that the tests are still flaky, I've ran the lib/lsf tests from main and also from this branch and they failed in the same rate: 30-50% (the new branch tests ran a lot faster, though!)

My test runs were done in interactive mode, 20 tests at a time. Have you tried something like that on your end?

#4 Updated by Tom Clegg 4 months ago

Huh. Yes, I have been testing with "20 test lib/lsf" so I thought that must have done it. But I tried another set of 20 just now, and two failed. The successes all take 5s or less, and timeout/failure takes 20s, so I'm suspecting a second/different bug.

#5 Updated by Tom Clegg 4 months ago

18670-flaky-lsf-test @ 1789aa86c580495f0a722289cec41c4e31872e26 -- developer-run-tests: #2892

This was happening:
  • RunContainer returns an error without finalizing container (e.g., "bsub" fails)
  • start()'s "tracker" goroutine unlocks the container, then deletes its entry in the trackers map
  • Meanwhile (after unlocking, but before deleting the tracker entry):
    • checkListForUpdates() processes a queue update with State=Queued, closes the tracker, and deletes its entry in "trackers"
    • checkListForUpdates() processes a queue update with State=Queued, locks the container, and starts a new tracker
    • the new tracker detects that the container cannot be run, and updates state to Cancelled
  • the old tracker's goroutine therefore mistakenly deletes the new tracker, not the old one
  • the new tracker's channel never receives any updates, and never closes
  • the new tracker's runContainer() waits for an update with state=Cancelled before calling "bkill", which never happens
  • the new tracker's LSF job stays in the LSF queue, which is (correctly) flagged by the test case
The fix:
  • tracker func in start() takes over listening to the "updates" channel after RunContainer() returns -- keeps trying to requeue/cancel the container (depending on RunContainer result) until checkListForUpdates() closes the channel
  • checkListForUpdates() is solely responsible for deleting tracker entries when they are seen to be requeued/cancelled in a queue update (mutex is already in place so it doesn't race with itself)

#6 Updated by Lucas Di Pentima 4 months ago

This LGTM, tested locally many many times, and didn't got any failure. Thanks!

#7 Updated by Tom Clegg 4 months ago

  • Status changed from In Progress to Resolved

Applied in changeset arvados-private:commit:arvados|c1e7f148bf3340300ae2f41d1ba7588cdfbb3b42.

#8 Updated by Peter Amstutz 2 months ago

  • Release set to 46

Also available in: Atom PDF