https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422018-03-21T19:59:20ZArvadosArvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=611832018-03-21T19:59:20ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Here's the relevant sbatch feature:</p>
<p>-t, --time=<time><br /> Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition's default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. The interval between signals is specified by the Slurm configuration parameter KillWait. The OverTimeLimit configuration parameter may permit the job to run longer than scheduled. Time resolution is one minute and second values are rounded up to the next minute.</p>
<p>A time limit of zero requests that no time limit be imposed. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=611852018-03-21T20:01:18ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Also:</p>
<p>--time-min=<time><br /> Set a minimum time limit on the job allocation. If specified, the job may have it's --time limit lowered to a value no lower than --time-min if doing so permits the job to begin execution earlier than otherwise possible. The job's time limit will not be changed after the job is allocated resources. This is performed by a backfill scheduling algorithm to allocate resources otherwise reserved for higher priority jobs. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=622982018-05-02T15:28:35ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Subject</strong> changed from <i>Allow user to specify resource limits for submitted jobs</i> to <i>Allow user to specify time limit for submitted jobs</i></li><li><strong>Description</strong> updated (<a title="View differences" href="/journals/62298/diff?detail_id=59337">diff</a>)</li></ul> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=623032018-05-02T16:14:41ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Story points</strong> set to <i>3.0</i></li></ul> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=623322018-05-02T17:06:51ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> changed from <i>To Be Groomed</i> to <i>Arvados Future Sprints</i></li></ul> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=636422018-06-20T16:02:17ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> changed from <i>Arvados Future Sprints</i> to <i>2018-07-03 Sprint</i></li></ul> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=636442018-06-20T16:03:47ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Lucas Di Pentima</i></li></ul> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=636462018-06-20T16:06:33ZTom Cleggtom@curii.com
<ul></ul><p>Might be better to implement this in sdk/go/dispatch / crunch-dispatch-slurm rather than asking slurm to do it. It sounds like slurm will send SIGKILL crunch-run after KillWait, which can prevent logs and (partial) outputs from being written to Keep.</p>
<p>Also suggest a more specific name than "time_limit", which (as a scheduling parameter) sounds like it could also mean queue time or queue+run time. "max_run_time"?</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=637182018-06-21T19:59:52ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>Might be better to implement this in sdk/go/dispatch / crunch-dispatch-slurm rather than asking slurm to do it. It sounds like slurm will send SIGKILL crunch-run after KillWait, which can prevent logs and (partial) outputs from being written to Keep.</p>
</blockquote>
<p>I think we want both. I agree we should prefer a graceful shutdown. However, slurm uses time limit in for its backfill scheduler. I don't think we really care on cloud but it is relevant for HPC. Maybe the slurm time limit should have some extra head room.</p>
<blockquote>
<p>Also suggest a more specific name than "time_limit", which (as a scheduling parameter) sounds like it could also mean queue time or queue+run time. "max_run_time"?</p>
</blockquote> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=637722018-06-26T14:00:51ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=637732018-06-26T15:16:45ZTom Cleggtom@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>I think we want both. I agree we should prefer a graceful shutdown. However, slurm uses time limit in for its backfill scheduler. I don't think we really care on cloud but it is relevant for HPC. Maybe the slurm time limit should have some extra head room.</p>
</blockquote>
<p>I see -- if we give slurm a time limit, it can make better scheduling decisions. But what would be the appropriate amount of time to allow for writing logs/outputs? It might be better to offer the "abandon the job completely, even if that means abandoning logs/outputs of a successful run" behavior with a separate knob. It seems like that's the only kind of limit slurm could use for scheduling purposes.</p>
<p>The objective isn't mentioned explicitly here but I think it's to reduce the cost of user containers that sometimes deadlock, or have pathologically low resource usage (e.g., arv-mount cache thrashing).</p>
<p>IMO we should implement this in a way that isn't slurm-specific at all, and avoid introducing other side effects like killing crunch-run while it's wrapping up.</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=637862018-06-27T13:21:05ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>(WIP) updates at <a class="changeset" title="13219: Checks for expired run time and cancel container if needed. Arvados-DCO-1.1-Signed-off-by..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/57fd9fa6bf0ee3062d7d38aceb7e97543791d241">57fd9fa6b</a> - branch <code>13219-jobs-time-limit</code></p>
<p>Tom: Do you think the updates at <code>dispatch.go</code> on this commit is the correct approach? Want to check with you before start writing tests.</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=637882018-06-27T13:52:27ZTom Cleggtom@curii.com
<ul></ul>Now that I see the logging awkwardness and the missing "start time" information, I'm thinking this would be simpler to implement in crunch-run:
<ul>
<li>easy to log the "max runtime exceeded" message live + to the permanent log</li>
<li>crunch-run already knows the container start time, so we don't have to start tracking that separately</li>
<li>WaitFinish() can just make a time.NewTimer() and add a section to the select block, similar to the "arv-mount exited" case.</li>
</ul>
<p>Other comment: scheduling_parameters should continue to be empty by default, in both container and container_request.</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=638912018-07-02T17:07:26ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Updates at <a class="changeset" title="13219: Adds TimeLimit support on arvados-cwl-runner Arvados-DCO-1.1-Signed-off-by: Lucas Di Pent..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/1f9519fba9a34f2a596c683ed6395b2e291935b7">1f9519fba</a><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/780/">https://ci.curoverse.com/job/developer-run-tests/780/</a></p>
<ul>
<li>Removed unnecessary default scheduling_parameter on the API server</li>
<li>Moved time out code from <code>dispatch</code> library to <code>crunch-run</code></li>
<li>Added CWL <code>TimeLimit</code> support on <code>arvados-cwl-runner</code></li>
</ul> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=638932018-07-02T18:03:32ZTom Cleggtom@curii.com
<ul></ul><p>Can the cwl test assertion be more focused? Diffing the whole submission seems like a bad habit that makes it hard to diagnose failing tests. Looking at test_initial_work_dir() I'm guessing we can do something like this:</p>
<pre><code class="python syntaxhl"><span class="n">_</span><span class="p">,</span> <span class="n">kwargs</span> <span class="o">=</span> <span class="n">runner</span><span class="p">.</span><span class="n">api</span><span class="p">.</span><span class="n">container_requests</span><span class="p">().</span><span class="n">create</span><span class="p">.</span><span class="n">call_args</span>
<span class="bp">self</span><span class="p">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="mi">42</span><span class="p">,</span> <span class="n">kwargs</span><span class="p">[</span><span class="s">'body'</span><span class="p">][</span><span class="s">'scheduling_parameters'</span><span class="p">].</span><span class="n">get</span><span class="p">(</span><span class="s">'max_run_time'</span><span class="p">))</span>
</code></pre>
<p>The rest LGTM, thanks</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=638962018-07-02T18:45:47ZLucas Di Pentimalucas.dipentima@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>Applied in changeset <a class="changeset" title="Merge branch '13219-jobs-time-limit' Closes #13219 Arvados-DCO-1.1-Signed-off-by: Lucas Di Penti..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/816764a283c2cbf2d41b4582113065922b99bd52">arvados|816764a283c2cbf2d41b4582113065922b99bd52</a>.</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=639852018-07-03T19:42:22ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Update at <a class="changeset" title="13219: Adds TimeLimit support on Arvados CWL schema. Arvados-DCO-1.1-Signed-off-by: Lucas Di Pen..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/380e4da5aab5d24d0e90ea27880974c232538fbf">380e4da5a</a> - branch <code>13219-arv-cwl-schema-fix</code><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/783/">https://ci.curoverse.com/job/developer-run-tests/783/</a></p>
<p>Arvados CWL schema updated -- now <code>arvados-cwl-runner</code> accepts the TimeLimit parameter. Example:</p>
<pre>
class: CommandLineTool
cwlVersion: v1.0
$namespaces:
cwltool: "http://commonwl.org/cwltool#"
inputs: []
outputs: []
requirements:
cwltool:TimeLimit:
timelimit: 5
baseCommand: [sleep, "30"]
</pre> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=640272018-07-05T17:04:06ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Lucas Di Pentima wrote:</p>
<blockquote>
<p>Update at <a class="changeset" title="13219: Adds TimeLimit support on Arvados CWL schema. Arvados-DCO-1.1-Signed-off-by: Lucas Di Pen..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/380e4da5aab5d24d0e90ea27880974c232538fbf">380e4da5a</a> - branch <code>13219-arv-cwl-schema-fix</code><br />Test run: <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/783/">https://ci.curoverse.com/job/developer-run-tests/783/</a></p>
<p>Arvados CWL schema updated -- now <code>arvados-cwl-runner</code> accepts the TimeLimit parameter. Example:</p>
<p>[...]</p>
</blockquote>
<p>This needs to be documented in <a class="external" href="http://doc.arvados.org/user/cwl/cwl-extensions.html">http://doc.arvados.org/user/cwl/cwl-extensions.html</a></p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=640282018-07-05T17:07:27ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>Peter Amstutz wrote:</p>
<blockquote>
<p>I think we want both. I agree we should prefer a graceful shutdown. However, slurm uses time limit in for its backfill scheduler. I don't think we really care on cloud but it is relevant for HPC. Maybe the slurm time limit should have some extra head room.</p>
</blockquote>
<p>I see -- if we give slurm a time limit, it can make better scheduling decisions. But what would be the appropriate amount of time to allow for writing logs/outputs? It might be better to offer the "abandon the job completely, even if that means abandoning logs/outputs of a successful run" behavior with a separate knob. It seems like that's the only kind of limit slurm could use for scheduling purposes.</p>
<p>The objective isn't mentioned explicitly here but I think it's to reduce the cost of user containers that sometimes deadlock, or have pathologically low resource usage (e.g., arv-mount cache thrashing).</p>
<p>IMO we should implement this in a way that isn't slurm-specific at all, and avoid introducing other side effects like killing crunch-run while it's wrapping up.</p>
</blockquote>
<p>I agree that doing it in crunch-run so that it only applies to the runtime of the actual job (and not the setup/teardown overhead) is the right way to do it, but I still think crunch-dispatch-slurm should also be setting a SLURM time limit for scheduling, with some configurable amount of head room. Perhaps we should reach out to our HPC users and see what they think?</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=640292018-07-05T17:33:33ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Update at <a class="changeset" title="13219: Adds documentation Arvados-DCO-1.1-Signed-off-by: Lucas Di Pentima <ldipentima@veritasgen..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/9b6abcd0448567146b471ad02162d33fd4b1d5a8">9b6abcd04</a></p>
<p>Adds documentation to CWL extension user's guide page.</p> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=640332018-07-05T17:46:49ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-6 status-1 priority-4 priority-default" href="/issues/13760">Idea #13760</a>: Provide more information to SLURM to make scheduling decisions on HPC</i> added</li></ul> Arvados - Idea #13219: Allow user to specify time limit for submitted jobshttps://dev.arvados.org/issues/13219?journal_id=647252018-07-23T18:41:45ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Release</strong> set to <i>13</i></li></ul>