https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422018-08-09T15:44:46ZArvadosArvados - Bug #13991: crunch-dispatch-slurm does not warn when slurm MaxJobCount reached https://dev.arvados.org/issues/13991?journal_id=655632018-08-09T15:44:46ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> set to <i>To Be Groomed</i></li></ul> Arvados - Bug #13991: crunch-dispatch-slurm does not warn when slurm MaxJobCount reached https://dev.arvados.org/issues/13991?journal_id=655702018-08-09T19:46:32ZTom Cleggtom@curii.com
<ul></ul><p>Tested on 9tee4 with MaxJobCount=2.</p>
<pre>
$ sbatch -N1 <(printf '#!/bin/sh\nsleep 8000\n')
sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying.
^C
</pre>
<p>It would be more convenient to get a non-zero exit code, so we don't have to scrape stderr while sbatch is still running.</p>
<p>There's an <code>sbatch --immediate</code> option that fails if the allocation can't be granted, but that's not what we want either. We want to fail only if the job can't be <em>queued</em>.</p>
So it seems the solution is for crunch-dispatch-slurm to monitor stderr while sbatch is running, and if that message appears:
<ul>
<li>Log a suggestion to increase MaxJobCount</li>
<li>Avoid starting more sbatch processes until this one exits (incidentally, we only recently stopped serializing <em>all</em> sbatch invocations).</li>
</ul>
<p>We should also log any other unexpected messages from sbatch, to make other similar problems easier to diagnose.</p> Arvados - Bug #13991: crunch-dispatch-slurm does not warn when slurm MaxJobCount reached https://dev.arvados.org/issues/13991?journal_id=942402021-07-06T21:10:06ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> deleted (<del><i>To Be Groomed</i></del>)</li></ul> Arvados - Bug #13991: crunch-dispatch-slurm does not warn when slurm MaxJobCount reached https://dev.arvados.org/issues/13991?journal_id=1117642023-02-14T22:22:12ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Release</strong> set to <i>60</i></li></ul> Arvados - Bug #13991: crunch-dispatch-slurm does not warn when slurm MaxJobCount reached https://dev.arvados.org/issues/13991?journal_id=1231452024-03-01T21:11:19ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> set to <i>Future</i></li></ul>