flaky test suite.TestSubmit in lib/lsf
The "bkill" stub randomly fails 10% of the time, so the 5-second retry delay was occasionally leaving the job in the fake lsf queue longer than the test's 20 second timeout.
Changed the 5-second delay to 1/2 of the configured PollInterval, and shortened some other timeouts in the test case to speed things up.
#3 Updated by Lucas Di Pentima 4 months ago
It seems that the tests are still flaky, I've ran the
lib/lsf tests from
main and also from this branch and they failed in the same rate: 30-50% (the new branch tests ran a lot faster, though!)
My test runs were done in interactive mode, 20 tests at a time. Have you tried something like that on your end?
- RunContainer returns an error without finalizing container (e.g., "bsub" fails)
- start()'s "tracker" goroutine unlocks the container, then deletes its entry in the trackers map
- Meanwhile (after unlocking, but before deleting the tracker entry):
- checkListForUpdates() processes a queue update with State=Queued, closes the tracker, and deletes its entry in "trackers"
- checkListForUpdates() processes a queue update with State=Queued, locks the container, and starts a new tracker
- the new tracker detects that the container cannot be run, and updates state to Cancelled
- the old tracker's goroutine therefore mistakenly deletes the new tracker, not the old one
- the new tracker's channel never receives any updates, and never closes
- the new tracker's runContainer() waits for an update with state=Cancelled before calling "bkill", which never happens
- the new tracker's LSF job stays in the LSF queue, which is (correctly) flagged by the test case
- tracker func in start() takes over listening to the "updates" channel after RunContainer() returns -- keeps trying to requeue/cancel the container (depending on RunContainer result) until checkListForUpdates() closes the channel
- checkListForUpdates() is solely responsible for deleting tracker entries when they are seen to be requeued/cancelled in a queue update (mutex is already in place so it doesn't race with itself)