https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422015-10-07T19:08:37ZArvadosArvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=310802015-10-07T19:08:37ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Tracker</strong> changed from <i>Bug</i> to <i>Idea</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/31080/diff?detail_id=30515">diff</a>)</li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=310812015-10-07T19:15:17ZBrett Smithbrett.smith@curii.com
<ul></ul><p>This can't just be Node Manager's job though, right? The system needs to know what Node Manager is willing to do, but any of these problems can also arise on static clusters that aren't even running Node Manager.</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=310822015-10-07T19:23:42ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Yes, that's true. I think the right long term solution is for crunch v2 to combine the jobs of crunch-dispatch and node manager into one process, because otherwise neither process has quite enough information to be able to tell the user what's actually going on.</p>
<p>In the short term, there's still benefit in making incremental improvements to node manager for cloud installs.</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=348632016-02-02T17:07:15ZTom Cleggtom@curii.com
<ul></ul><p>It seems like Nodemanager should emit a log (with object_uuid == job uuid) and cancel the job.</p>
<p>If we start telling crunch-dispatch whether nodemanager is running, in cases where nodemanager <em>isn't</em> running, crunch-dispatch could emit a log and cancel the job if it's unsatisfiable with the current set of (alive?) slurm nodes.</p>
<p>Short of running nodemanager on static clusters (add a slurm driver?) it seems like we need the logic in both places if we want to fix the bug in both types of install.</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=512012017-04-26T20:38:03ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> set to <i>2017-05-24 sprint</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=515132017-05-09T19:07:28ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Story points</strong> set to <i>1.0</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=515142017-05-09T19:12:14ZTom Cleggtom@curii.com
<ul></ul><p>For crunch2, when node manager is not in use, sbatch rejects unsatisfiable jobs and the user gets an error -- however, crunch-dispatch-slurm will keep retrying forever. This infinite-retry problem will be mostly addressed by <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: [Crunch2] Limit number of dispatch attempts per container (Resolved)" href="https://dev.arvados.org/issues/9688">#9688</a>, but ideally crunch-dispatch-slurm should also recognize the "unsatisfiable job" error as a non-retryable error, and tell the API server that it won't be re-attempted (if crunch-dispatch-slurm assumes/knows that it is the only dispatcher, it can indicate this by cancelling the container).</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=516502017-05-10T19:44:32ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> changed from <i>2017-05-24 sprint</i> to <i>2017-06-07 sprint</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=521032017-05-24T18:59:21ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Lucas Di Pentima</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=524272017-06-07T18:21:56ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-06-07 sprint</i> to <i>2017-06-21 sprint</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=527672017-06-19T19:52:44ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=528492017-06-21T18:25:33ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-06-21 sprint</i> to <i>2017-07-05 sprint</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=531852017-07-03T20:14:33ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates @ <a class="changeset" title="7475: Cancel jobs that cannot be satisfied instead of endlessly retry to run it. Add a log entry ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/3dad67f271492790f63e72ffcbba432cf8e00fa5">3dad67f27</a><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/376/">https://ci.curoverse.com/job/developer-run-tests/376/</a></p>
<p>Modified <code>ServerCalculator.servers_for_queue()</code> so that it also returns a <code>dict</code> with information about unsatisfiable jobs that should be cancelled by its caller.<br />Updated some tests that started failing because of this change.<br />New tests pending.</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=531912017-07-05T14:31:34ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>New updates at <a class="changeset" title="7475: Catch exceptions when trying to cancel an unsatisfiable job, logging an error message in ca..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f77d08dd57a1021525717c8669296eb3e463c5f7">f77d08dd5</a><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/377/">https://ci.curoverse.com/job/developer-run-tests/377/</a></p>
<ul>
<li>Enhanced error checking when trying to emit a log and cancel an unsatisfiable job.</li>
<li>Added test cases.</li>
</ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=531982017-07-05T15:56:27ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>7475-nodemgr-unsatisfiable-job-comms @ <a class="changeset" title="7475: Catch exceptions when trying to cancel an unsatisfiable job, logging an error message in ca..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f77d08dd57a1021525717c8669296eb3e463c5f7">f77d08dd57a1021525717c8669296eb3e463c5f7</a></p>
<ul>
<li>In _got_response, the uuid can be either a job or a container. It needs to look at the type field of the uuid. This is only valid if the uuid is for a job:</li>
</ul>
<pre>
self._client.jobs().cancel(uuid=job_uuid).execute()
</pre>
<p>If the uuid is for a container and <code>self.slurm_queue</code> is true, it should do this:</p>
<pre>
subprocess.check_call(['scancel', '--name='+uuid])
</pre>
<p>This may require a stub to ensure that tests don't try to call the real <code>scancel</code>.</p>
<p>I'd like to see an integration test, if it isn't too much work. Upon seeing the log message about an unsatisfiable job/container, it should check that (a) the expected log message was added and (b) the job was cancelled/scancel was called.</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=532192017-07-05T18:43:13ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-07-05 sprint</i> to <i>2017-07-19 sprint</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=532962017-07-06T21:14:19ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates at <a class="changeset" title="7475: Check for job unsatisfiable type (job/container) and cancel it using the proper method. Upd..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f507162f3974797b741a0f740b407daefceab0b6">f507162f3</a><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/378/">https://ci.curoverse.com/job/developer-run-tests/378/</a></p>
<p>Added support for unsatisfiable containers. Updated unit test to cover both cases.<br />Pending: integration test.</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=533672017-07-11T14:21:27ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Writing an integration test:</p>
<p>Start by copying "test_single_node_azure".</p>
<p>The format of the test case is (steps, checks, driver, jobs, cloud).</p>
<p>For the first step, instead of <code>set_squeue</code> you'll need a new function like <code>set_queue_unsatisfiable</code>. This should do something like echo '99|100|100|%s|%s' (this would be a job that requests 99 cores).</p>
<p>This function should use <code>update_script</code> to create a stub for <code>scancel</code>. The stub script should do something to record that it was called, like writing a file.</p>
<p>The next line should have a regex to match the error message that node manager puts out when the job is can't be satisfied.<br />This should call a function that checks the API server logs table that the right log message was added.<br />It should also check for the presence of the file that indicates scancel was called. The function is supposed to return 0 for success and 1 for failure.</p>
<p>That's it. You don't need any other steps. For checks (if they match, that is a failure). You might want to have "Cloud node is now paired ..." as a negative check.</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=533732017-07-11T17:10:56ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates at <a class="changeset" title="7475: Added integration test that checks for scancel to be called and a log entry added to an uns..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/7d4a10bcc197e909a7a9d5aeb4ba18c91a218976">7d4a10bcc</a></p>
<p>Added integration test following the above instructions.</p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=535702017-07-19T18:46:54ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-07-19 sprint</i> to <i>2017-08-02 sprint</i></li></ul> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=537542017-07-27T15:29:30ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><pre>
Start test_hit_quota
test_hit_quota passed
Start test_multiple_nodes
Traceback (most recent call last):
File "tests/integration_test.py", line 441, in <module>
main()
File "tests/integration_test.py", line 431, in main
code += run_test(t, *tests[t])
File "tests/integration_test.py", line 244, in run_test
shutil.rmtree(os.path.dirname(unsatisfiable_job_scancelled))
File "/usr/lib/python2.7/shutil.py", line 239, in rmtree
onerror(os.listdir, path, sys.exc_info())
File "/usr/lib/python2.7/shutil.py", line 237, in rmtree
names = os.listdir(path)
OSError: [Errno 2] No such file or directory: '/tmp/tmp59u2RS'
</pre>
<p>I think you want <code>global unsatisfiable_job_scancelled</code> and then create the tempdir in <code>run_test()</code></p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=537962017-07-29T15:05:30ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Sorry, I thought I tested it before pushing.</p>
<p>Updated at <a class="changeset" title="7475: Fixed integration test Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <lucas@curoverse.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/3e46aaf6469db111d549a9a5058f3ee4926e0200">3e46aaf64</a><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/406/">https://ci.curoverse.com/job/developer-run-tests/406/</a></p> Arvados - Idea #7475: [Node manager] Better communication when job is unsatisfiablehttps://dev.arvados.org/issues/7475?journal_id=538002017-07-31T14:55:05ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset arvados|commit:c0e203e7f3e9e40736eac63cbe440d5e46e379c0.</p>