Bug #18670
closedflaky test suite.TestSubmit in lib/lsf
Updated by Tom Clegg about 3 years ago
The "bkill" stub randomly fails 10% of the time, so the 5-second retry delay was occasionally leaving the job in the fake lsf queue longer than the test's 20 second timeout.
Changed the 5-second delay to 1/2 of the configured PollInterval, and shortened some other timeouts in the test case to speed things up.
18670-flaky-lsf-test @ c361e51569e28f30bd034ac240b936346224a0d0 -- developer-run-tests: #2889
Updated by Lucas Di Pentima about 3 years ago
It seems that the tests are still flaky, I've ran the lib/lsf
tests from main
and also from this branch and they failed in the same rate: 30-50% (the new branch tests ran a lot faster, though!)
My test runs were done in interactive mode, 20 tests at a time. Have you tried something like that on your end?
Updated by Tom Clegg about 3 years ago
Huh. Yes, I have been testing with "20 test lib/lsf" so I thought that must have done it. But I tried another set of 20 just now, and two failed. The successes all take 5s or less, and timeout/failure takes 20s, so I'm suspecting a second/different bug.
Updated by Tom Clegg about 3 years ago
18670-flaky-lsf-test @ 1789aa86c580495f0a722289cec41c4e31872e26 -- developer-run-tests: #2892
This was happening:- RunContainer returns an error without finalizing container (e.g., "bsub" fails)
- start()'s "tracker" goroutine unlocks the container, then deletes its entry in the trackers map
- Meanwhile (after unlocking, but before deleting the tracker entry):
- checkListForUpdates() processes a queue update with State=Queued, closes the tracker, and deletes its entry in "trackers"
- checkListForUpdates() processes a queue update with State=Queued, locks the container, and starts a new tracker
- the new tracker detects that the container cannot be run, and updates state to Cancelled
- the old tracker's goroutine therefore mistakenly deletes the new tracker, not the old one
- the new tracker's channel never receives any updates, and never closes
- the new tracker's runContainer() waits for an update with state=Cancelled before calling "bkill", which never happens
- the new tracker's LSF job stays in the LSF queue, which is (correctly) flagged by the test case
- tracker func in start() takes over listening to the "updates" channel after RunContainer() returns -- keeps trying to requeue/cancel the container (depending on RunContainer result) until checkListForUpdates() closes the channel
- checkListForUpdates() is solely responsible for deleting tracker entries when they are seen to be requeued/cancelled in a queue update (mutex is already in place so it doesn't race with itself)
Updated by Lucas Di Pentima about 3 years ago
This LGTM, tested locally many many times, and didn't got any failure. Thanks!
Updated by Tom Clegg about 3 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados-private:commit:arvados|c1e7f148bf3340300ae2f41d1ba7588cdfbb3b42.