https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422022-01-24T20:07:48ZArvadosArvados - Bug #18670: flaky test suite.TestSubmit in lib/lsfhttps://dev.arvados.org/issues/18670?journal_id=1002072022-01-24T20:07:48ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Bug #18670: flaky test suite.TestSubmit in lib/lsfhttps://dev.arvados.org/issues/18670?journal_id=1002252022-01-25T04:50:52ZTom Cleggtom@curii.com
<ul></ul><p>The "bkill" stub randomly fails 10% of the time, so the 5-second retry delay was occasionally leaving the job in the fake lsf queue longer than the test's 20 second timeout.</p>
<p>Changed the 5-second delay to 1/2 of the configured PollInterval, and shortened some other timeouts in the test case to speed things up.</p>
<p>18670-flaky-lsf-test @ <a class="changeset" title="18670: Fix unreliable test. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/c361e51569e28f30bd034ac240b936346224a0d0">c361e51569e28f30bd034ac240b936346224a0d0</a> -- <a class="external" href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2889/"<a href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2889/">developer-run-tests: #2889 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=2889" alt="" /></a></a></p> Arvados - Bug #18670: flaky test suite.TestSubmit in lib/lsfhttps://dev.arvados.org/issues/18670?journal_id=1002462022-01-25T16:39:52ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>It seems that the tests are still flaky, I've ran the <code>lib/lsf</code> tests from <code>main</code> and also from this branch and they failed in the same rate: 30-50% (the new branch tests ran a lot faster, though!)</p>
<p>My test runs were done in interactive mode, 20 tests at a time. Have you tried something like that on your end?</p> Arvados - Bug #18670: flaky test suite.TestSubmit in lib/lsfhttps://dev.arvados.org/issues/18670?journal_id=1002522022-01-25T17:57:27ZTom Cleggtom@curii.com
<ul></ul><p>Huh. Yes, I have been testing with "20 test lib/lsf" so I thought that must have done it. But I tried another set of 20 just now, and two failed. The successes all take 5s or less, and timeout/failure takes 20s, so I'm suspecting a second/different bug.</p> Arvados - Bug #18670: flaky test suite.TestSubmit in lib/lsfhttps://dev.arvados.org/issues/18670?journal_id=1003202022-01-26T18:33:43ZTom Cleggtom@curii.com
<ul></ul><p>18670-flaky-lsf-test @ <a class="changeset" title="18670: Fix abandoned job tracker during race. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curi..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/1789aa86c580495f0a722289cec41c4e31872e26">1789aa86c580495f0a722289cec41c4e31872e26</a> -- <a class="external" href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2892/"<a href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2892/">developer-run-tests: #2892 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=2892" alt="" /></a></a></p>
This was happening:
<ul>
<li>RunContainer returns an error without finalizing container (e.g., "bsub" fails)</li>
<li>start()'s "tracker" goroutine unlocks the container, then deletes its entry in the trackers map</li>
<li>Meanwhile (after unlocking, but before deleting the tracker entry):
<ul>
<li>checkListForUpdates() processes a queue update with State=Queued, closes the tracker, and deletes its entry in "trackers" </li>
<li>checkListForUpdates() processes a queue update with State=Queued, locks the container, and starts a new tracker</li>
<li>the new tracker detects that the container cannot be run, and updates state to Cancelled</li>
</ul>
</li>
<li>the old tracker's goroutine therefore mistakenly deletes the <em>new</em> tracker, not the old one</li>
<li>the new tracker's channel never receives any updates, and never closes</li>
<li>the new tracker's runContainer() waits for an update with state=Cancelled before calling "bkill", which never happens</li>
<li>the new tracker's LSF job stays in the LSF queue, which is (correctly) flagged by the test case</li>
</ul>
The fix:
<ul>
<li>tracker func in start() takes over listening to the "updates" channel after RunContainer() returns -- keeps trying to requeue/cancel the container (depending on RunContainer result) until checkListForUpdates() closes the channel</li>
<li>checkListForUpdates() is solely responsible for deleting tracker entries when they are seen to be requeued/cancelled in a queue update (mutex is already in place so it doesn't race with itself)</li>
</ul> Arvados - Bug #18670: flaky test suite.TestSubmit in lib/lsfhttps://dev.arvados.org/issues/18670?journal_id=1003322022-01-27T13:58:12ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>This LGTM, tested locally many many times, and didn't got any failure. Thanks!</p> Arvados - Bug #18670: flaky test suite.TestSubmit in lib/lsfhttps://dev.arvados.org/issues/18670?journal_id=1003392022-01-27T15:07:15ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>Applied in changeset arvados-private:commit:arvados|c1e7f148bf3340300ae2f41d1ba7588cdfbb3b42.</p> Arvados - Bug #18670: flaky test suite.TestSubmit in lib/lsfhttps://dev.arvados.org/issues/18670?journal_id=1021362022-03-24T19:28:39ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Release</strong> set to <i>46</i></li></ul>