https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422021-10-26T18:05:08ZArvadosArvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=979422021-10-26T18:05:08ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=979432021-10-26T18:06:11ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Category</strong> set to <i>Crunch</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/97943/diff?detail_id=94588">diff</a>)</li><li><strong>Subject</strong> changed from <i>Feedback when container can't be scheduled</i> to <i>Feedback when container can't be scheduled on LSF</i></li></ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=979442021-10-26T18:06:19ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>New</i></li><li><strong>Category</strong> changed from <i>Crunch</i> to <i>0</i></li></ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=979452021-10-26T18:06:31ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Release</strong> set to <i>45</i></li><li><strong>Category</strong> deleted (<del><i>0</i></del>)</li></ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=980012021-10-27T14:49:36ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>2021-11-10 sprint</i> to <i>2021-11-24 sprint</i></li></ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=983632021-11-09T20:25:50ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Feature</i></li></ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=984332021-11-10T16:22:11ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Tom Clegg</i></li></ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=984972021-11-12T16:05:37ZTom Cleggtom@curii.com
<ul></ul><p>Hm.</p>
<p>Here is `bjobs -UF` ("unformatted") for a job that will run when another job finishes:</p>
<pre>
Job <22701>, Job Name <aaaaa-aaaaa-aaaaaaaaaab>, User <tom>, Project <default>, Status <PEND>, Queue <normal>, Command <sleep 120>
Fri Nov 12 15:54:26: Submitted from host <9tee4.arvadosapi.com>, CWD <$HOME>, Requested Resources <rusage[mem=8000.00] span[hosts=1]>;
PENDING REASONS:
Job requirements for reserving resource (mem) not satisfied: 2 hosts;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
RESOURCE REQUIREMENT DETAILS:
Combined: select[type == local] order[r15s:pg] rusage[mem=8000.00] span[hosts=1]
Effective: -
</pre>
<p>Here is a job that will never run because there are no nodes with this much memory:</p>
<pre>
Job <22735>, Job Name <aaaaa-aaaaa-aaaaaaaaaag>, User <tom>, Project <default>, Status <PEND>, Queue <normal>, Command <sleep 120>
Fri Nov 12 15:57:44: Submitted from host <9tee4.arvadosapi.com>, CWD <$HOME>, Requested Resources <rusage[mem=8000000.00] span[hosts=1]>;
PENDING REASONS:
Job requirements for reserving resource (mem) not satisfied: 2 hosts;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
RESOURCE REQUIREMENT DETAILS:
Combined: select[type == local] order[r15s:pg] rusage[mem=8000000.00] span[hosts=1]
Effective: -
</pre> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=984982021-11-12T16:14:10ZTom Cleggtom@curii.com
<ul></ul><p>If we add <code>-R 'select[mem>8000000M]'</code> to the bsub arguments, <code>bjobs -UF</code> says "New job is waiting for scheduling" for a while, then:</p>
<pre>
Job <22930>, Job Name <aaaaa-aaaaa-aaaaaaaaaah>, User <tom>, Project <default>, Status <PEND>, Queue <normal>, Command <sleep 120>
Fri Nov 12 16:11:05: Submitted from host <9tee4.arvadosapi.com>, CWD <$HOME>, Requested Resources < select[mem>8000000.00] rusage[mem=8000000.00] span[hosts=1]>;
PENDING REASONS:
There are no suitable hosts for the job;
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
RESOURCE REQUIREMENT DETAILS:
Combined: select[(mem>8000000.00) && (type == local)] order[r15s:pg] rusage[mem=8000000.00] span[hosts=1]
Effective: -
</pre> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=987692021-11-18T16:30:41ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=987822021-11-18T21:50:18ZTom Cleggtom@curii.com
<ul></ul><p>18298-lsf-no-suitable-hosts @ <a class="changeset" title="18298: Use bjobs select[] args, cancel on "no suitable host". Arvados-DCO-1.1-Signed-off-by: Tom..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f6e8d7c2cada1570bac3e98f0712ad8651b8d9fd">f6e8d7c2cada1570bac3e98f0712ad8651b8d9fd</a> -- <a class="external" href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2810/"<a href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2810/">developer-run-tests: #2810 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=2810" alt="" /></a></a></p>
<p>If LSF reports the job status is PEND and the reason contains the magic string "There are no suitable hosts for the job", cancel the container and copy the reason text into <code>runtime_status["errors"]</code>.</p> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=988362021-11-19T21:34:25ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>18298-lsf-no-suitable-hosts @ <a class="changeset" title="18298: Use bjobs select[] args, cancel on "no suitable host". Arvados-DCO-1.1-Signed-off-by: Tom..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f6e8d7c2cada1570bac3e98f0712ad8651b8d9fd">f6e8d7c2cada1570bac3e98f0712ad8651b8d9fd</a> -- <a class="external" href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2810/"<a href="https://ci.arvados.org/view/Developer/job/developer-run-tests/2810/">developer-run-tests: #2810 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=2810" alt="" /></a></a></p>
<p>If LSF reports the job status is PEND and the reason contains the magic string "There are no suitable hosts for the job", cancel the container and copy the reason text into <code>runtime_status["errors"]</code>.</p>
</blockquote>
<ul>
<li>Getting json structured output out of <code>bjobs</code> is nice.</li>
<li>It seems unfortunate that in the first case it tells you what resource can't be reserved (mem), but not in the second case.</li>
<li>Were you able to test this branch on 9tee4 already or should we plan to do a quick manual test it after it is merged and auto-deployed to 9tee4?</li>
</ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=988422021-11-19T22:28:58ZTom Cleggtom@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<ul>
<li>It seems unfortunate that in the first case it tells you what resource can't be reserved (mem), but not in the second case.</li>
</ul>
</blockquote>
<p>Yeah, it seems a bit backwards.</p>
<blockquote>
<ul>
<li>Were you able to test this branch on 9tee4 already or should we plan to do a quick manual test it after it is merged and auto-deployed to 9tee4?</li>
</ul>
</blockquote>
<p>I have not tested it with real LSF. Doing that after auto-deploy sounds worthwhile, yes.</p> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=988572021-11-22T16:59:03ZTom Cleggtom@curii.com
<ul><li><strong>File</strong> <a href="/attachments/2918">test18298-9tee4.png</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/2918/test18298-9tee4.png">test18298-9tee4.png</a> added</li></ul><p><a href="https://arvadosapi.com/9tee4-xvhdp-fjb1ctlvsbtn5dk">9tee4-xvhdp-fjb1ctlvsbtn5dk</a> / <a href="https://arvadosapi.com/9tee4-dz642-u3hjqy0agtl3jya">9tee4-dz642-u3hjqy0agtl3jya</a></p>
<p><img src="https://dev.arvados.org/attachments/download/2918/test18298-9tee4.png" alt="" /></p> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=988582021-11-22T17:05:51ZTom Cleggtom@curii.com
<ul></ul>Couple of caveats:
<ul>
<li>Thanks to auto-retry after cancel, this has to happen 3 times before giving up on the container request</li>
<li>If the admin has configured an explicit BsubArgumentsList based on an old config.default.yml file, the new arguments won't get passed to bsub and they will get the old "queued forever" behavior. Should we reconsider the previous (pre <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: [LSF] Add "/host" to rusage strings (Resolved)" href="https://dev.arvados.org/issues/18290">#18290</a>) approach of appending the configured args to the defaults rather than replacing the defaults? Or just add an upgrade note?</li>
</ul> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=988592021-11-22T18:04:22ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
Couple of caveats:
<ul>
<li>Thanks to auto-retry after cancel, this has to happen 3 times before giving up on the container request</li>
</ul>
</blockquote>
<p>This seems inelegant but not actually a problem?</p>
<blockquote>
<ul>
<li>If the admin has configured an explicit BsubArgumentsList based on an old config.default.yml file, the new arguments won't get passed to bsub and they will get the old "queued forever" behavior. Should we reconsider the previous (pre <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: [LSF] Add "/host" to rusage strings (Resolved)" href="https://dev.arvados.org/issues/18290">#18290</a>) approach of appending the configured args to the defaults rather than replacing the defaults? Or just add an upgrade note?</li>
</ul>
</blockquote>
<p>Upgrade note is fine. I don't know if anyone is actually using LSF yet. We can communicate about the upgrade to customers.</p> Arvados - Feature #18298: Feedback when container can't be scheduled on LSFhttps://dev.arvados.org/issues/18298?journal_id=988622021-11-22T18:46:47ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><blockquote><blockquote>
<ul>
<li>Thanks to auto-retry after cancel, this has to happen 3 times before giving up on the container request</li>
</ul>
</blockquote>
<p>This seems inelegant but not actually a problem?</p>
</blockquote>
<p>Right.</p>
<blockquote>
<p>Upgrade note is fine. I don't know if anyone is actually using LSF yet. We can communicate about the upgrade to customers.</p>
</blockquote>
<p>Added upgrade note & merged.</p>