https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422018-03-28T15:52:41ZArvadosArvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=613702018-03-28T15:52:41ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> set to <i>2018-04-11 Sprint</i></li></ul><p>Perhaps investigate introducing concurrency.</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=613802018-03-28T16:21:09ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Peter Amstutz</i></li></ul> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=615012018-04-02T20:40:56ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>This create request has "duration":350.31,"view":0.34,"db":116.21.</p>
<p>This seems a little excessive. Increasing the number of submissions in flight might jam up the database without a significant impact on throughput.</p>
<p>Based on logs alone I can't account for all that latency.</p>
<pre>
{
"method": "POST",
"path": "/arvados/v1/container_requests",
"format": "json",
"controller": "Arvados::V1::ContainerRequestsController",
"action": "create",
"status": 200,
"duration": 350.31,
"view": 0.34,
"db": 116.21,
"request_id": "req-hc9pcexvfftojkmv4j70",
"client_ipaddr": "172.17.0.1",
"client_auth": "<a href="https://arvadosapi.com/2tlax-gj3su-sq60vfd0xzi13wh">2tlax-gj3su-sq60vfd0xzi13wh</a>",
"params": {
"scheduling_parameters": {},
"runtime_constraints": {
"vcpus": 1,
"ram": 1073741824
},
"use_existing": true,
"secret_mounts": {},
"mounts": {
"/tmp": {
"kind": "tmp",
"capacity": 1073741824
},
"/var/spool/cwl": {
"kind": "tmp",
"capacity": 1073741824
}
},
"owner_uuid": null,
"properties": {},
"output_ttl": 0,
"environment": {
"HOME": "/var/spool/cwl",
"TMPDIR": "/tmp"
},
"name": "step2_4",
"priority": 500,
"state": "Committed",
"command": [
"/bin/sh",
"-c",
"echo starting && sleep 194 && echo the number of the day is 4"
],
"container_image": "arvados/jobs",
"output_path": "/var/spool/cwl",
"cwd": "/var/spool/cwl",
"alt": "json",
"container_request": {
"owner_uuid": null,
"name": "step2_4",
"properties": {},
"state": "Committed",
"mounts": {
"/tmp": {
"kind": "tmp",
"capacity": 1073741824
},
"/var/spool/cwl": {
"kind": "tmp",
"capacity": 1073741824
}
},
"runtime_constraints": {
"vcpus": 1,
"ram": 1073741824
},
"container_image": "arvados/jobs",
"environment": {
"HOME": "/var/spool/cwl",
"TMPDIR": "/tmp"
},
"cwd": "/var/spool/cwl",
"command": [
"/bin/sh",
"-c",
"echo starting && sleep 194 && echo the number of the day is 4"
],
"output_path": "/var/spool/cwl",
"priority": 500,
"use_existing": true,
"scheduling_parameters": {},
"output_ttl": 0
}
},
"@timestamp": "2018-04-02T20:26:45.095963550Z",
"@version": "1",
"message": "[200] POST /arvados/v1/container_requests (Arvados::V1::ContainerRequestsController#create)"
}
</pre> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=615942018-04-05T16:16:42ZTom Cleggtom@curii.com
<ul></ul>While I also like the idea of making the "create" call faster, I think we should start by adding concurrency to arvados-cwl-runner.
<ul>
<li>Adding concurrency to arvados-cwl-runner is necessary to support large workflows efficiently, no matter how fast we make the "create container request" operation. (When API server is working well, request turnaround time is limited by network latency and single-threaded Ruby code.)</li>
<li>Limiting concurrency to 1 in each arvados-cwl-runner process is not an effective way to avoid overloading an API server with more requests than it can handle. This can only be addressed on the API side.</li>
<li>a-c-r must behave well when API server is processing requests slowly -- whether that's because this a-c-r process is overloading API, or something else is overloading API.</li>
</ul>
<p>Ideally API server would tell clients where to cap concurrent connections, so it could be site-configurable and/or dynamic in response to load, but a command-line option (defaulting to 1? 4? ncpus?) would be a good start.</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=615952018-04-05T16:29:52ZTom Cleggtom@curii.com
<ul></ul><p>Even now, the timing figures in note-3 suggest we spend 2x as much time in Ruby as we do waiting for the DB, which seems to mean the database won't even be busy with one query at a time until we are doing at least 3 concurrent API requests. PostgreSQL is good at concurrent queries, so I expect even that example (dev?) site can do way more than 3 efficiently.</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=615982018-04-05T17:42:39ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>So the main reason I started looking into it was that on my development install, with a-c-r submitting a 100-way scatter of container requests with no parallelization, with "top" I was seeing four maxed out ruby processes.</p>
<p>But that's probably not representative of production, and clearly we want parallel queries to overcome other sources of latency. But I thought it was notable that transactions involving container requests take 3x-6x as long as transactions involving collections (for example).</p>
<p>I'll get started on parallel requests.</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=616072018-04-05T19:31:36ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Notes to self:</p>
<ul>
<li>Need to use ThreadSafeApiCache</li>
<li>Spin off ArvadosContainer.run() into its own thread (or queue or threadpool)</li>
<li>Need to keep a "pending" counter of requests in flight, decrement when container uuid is entered into process table (add methods to ArvCwlRunner instead of changing processes directly)</li>
<li>SecretStore is not threadsafe but it doesn't matter, secrets are only entered during initialization, after that it is read only</li>
<li>CollectionFsAccess should be threadsafe via CollectionCache</li>
<li>Need to take workflow evaluation lock (ArvCwlRunner.cond) around output_callback</li>
<li>arv_docker_get_image needs to be support concurrent invocation without stepping on itself</li>
</ul> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=616302018-04-06T17:39:46ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>13108-cwl-parallel-submit @ <a class="changeset" title="13108: Fix tests. Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <pamstutz@veritasgenetics.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/091c2bc58ff1a8d1c43abf0e334837d8872b914d">091c2bc58ff1a8d1c43abf0e334837d8872b914d</a></p>
<p><a class="external" href="https://ci.curoverse.com/job/developer-run-tests/681/">https://ci.curoverse.com/job/developer-run-tests/681/</a></p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=616612018-04-09T17:02:57ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=616722018-04-09T19:40:50ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>Here're some comments/questions:</p>
<ul>
<li>File <code>sdk/python/arvados/commands/keepdocker.py</code>:
<ul>
<li>Lines 425-429: For the sake of having less lines of code, could we say something like <code>parent_project_uuid = args.project_uuid or api.users().current().execute(…)</code> ?</li>
</ul>
</li>
<li>File <code>sdk/cwl/arvados_cwl/__init__.py</code>:
<ul>
<li>Lines 168, 169: I think we could save 1 indentation level by putting all those conditionals on a single <code>if</code> statement</li>
<li>Lines 181, 182: Should those be inside the <code>with:</code> block? If not, why is it necessary to use a lock to just get a value from a dict?</li>
<li>Line 389: Why passing <code>thread_count</code> through cwltool/arv_executor instead of just using it at <code>ArvCwlRunner</code> instantiation?</li>
<li>Line 554: Small typo ‘pendingjobs’ on log message</li>
</ul>
</li>
<li>Would it be convenient to write some tests for <code>TaskQueue</code>? It’s written as it would be useful to other modules as needed, so I think it would be nice to guarantee that’s working as expected.</li>
<li>File <code>sdk/cwl/arvados_cwl/task_queue.py</code>:
<ul>
<li>Line 45: Should <code>task()</code> be inside a <code>try: except:</code> block?</li>
<li>Line 51: I think that <code>task_queue.empty()</code> check is unnecessary, <code>get()</code> will still raise a <code>Queue.Empty</code> exception when it should.</li>
</ul></li>
</ul> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=617472018-04-10T18:28:40ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Lucas Di Pentima wrote:</p>
<blockquote>
<p>Here're some comments/questions:</p>
<ul>
<li>File <code>sdk/python/arvados/commands/keepdocker.py</code>:
<ul>
<li>Lines 425-429: For the sake of having less lines of code, could we say something like <code>parent_project_uuid = args.project_uuid or api.users().current().execute(…)</code> ?</li>
</ul></li>
</ul>
</blockquote>
<p>Done.</p>
<blockquote>
<ul>
<li>File <code>sdk/cwl/arvados_cwl/__init__.py</code>:
<ul>
<li>Lines 168, 169: I think we could save 1 indentation level by putting all those conditionals on a single <code>if</code> statement</li>
</ul></li>
</ul>
</blockquote>
<p>Done.</p>
<blockquote>
<ul>
<li>Lines 181, 182: Should those be inside the <code>with:</code> block? If not, why is it necessary to use a lock to just get a value from a dict?</li>
</ul>
</blockquote>
<p>The other lines don't need to be in the with: block. You're right it isn't technically necessary to get a lock to get a value from a dict, however updating the <code>processes</code> dict can occur together in the same critical section as other operations so it is better to be on the safe side.</p>
<blockquote>
<ul>
<li>Line 389: Why passing <code>thread_count</code> through cwltool/arv_executor instead of just using it at <code>ArvCwlRunner</code> instantiation?</li>
</ul>
</blockquote>
<p>You're right, that's better. Done.</p>
<blockquote>
<ul>
<li>Line 554: Small typo ‘pendingjobs’ on log message</li>
</ul>
</blockquote>
<p>Fixed.</p>
<blockquote>
<ul>
<li>Would it be convenient to write some tests for <code>TaskQueue</code>? It’s written as it would be useful to other modules as needed, so I think it would be nice to guarantee that’s working as expected.</li>
</ul>
</blockquote>
<p>I added a couple of simple tests.</p>
<blockquote>
<ul>
<li>File <code>sdk/cwl/arvados_cwl/task_queue.py</code>:
<ul>
<li>Line 45: Should <code>task()</code> be inside a <code>try: except:</code> block?</li>
</ul></li>
</ul>
</blockquote>
<p>No. When task() is called synchronously, if there is an unhandled exception, it should go to the caller.</p>
<blockquote>
<ul>
<li>Line 51: I think that <code>task_queue.empty()</code> check is unnecessary, <code>get()</code> will still raise a <code>Queue.Empty</code> exception when it should.</li>
</ul>
</blockquote>
<p>No, when <code>Queue.get()</code> is called without block=False or a timeout, it won't raise an Empty() exception. However, it also seems that calling get() will raise the Empty exception in any case that it can't get the queue lock, not just when the queue is actually empty, so getting the Empty exception doesn't guarantee that the queue is really empty. So we still want to explicitly check if it is empty.</p>
<p>However it is also possible that something could sweep in a grab the last item between checking empty() calling get() so I added a timeout.</p>
<p>13108-cwl-parallel-submit @ <a class="changeset" title="13108: Add test for taskqueue Also tighten up code in a few places. Arvados-DCO-1.1-Signed-off-..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/138fef8ee97f3cbd335434ad6acd26771fd0b762">138fef8ee97f3cbd335434ad6acd26771fd0b762</a></p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=617522018-04-10T19:35:54ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>This LGTM. Thanks!</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=617542018-04-10T19:46:15ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>50</i> to <i>100</i></li></ul><p>Applied in changeset <a class="changeset" title="Merge branch '13108-cwl-parallel-submit' closes #13108 Arvados-DCO-1.1-Signed-off-by: Peter Amst..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/20846db140dbba8a688718716e1e0f99ccfb3b51">arvados|20846db140dbba8a688718716e1e0f99ccfb3b51</a>.</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=617932018-04-11T15:24:43ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>Resolved</i> to <i>Feedback</i></li><li><strong>Target version</strong> changed from <i>2018-04-11 Sprint</i> to <i>2018-04-25 Sprint</i></li></ul> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=618552018-04-13T02:31:09ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>13108-acr-threading-fixes @ <a class="changeset" title="13108: crunch_script uses safeapi Also take workflow lock on update_pipeline_component to preven..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/19da21ab8e56154d7db15c2643524cb8348a7a8a">19da21ab8e56154d7db15c2643524cb8348a7a8a</a></p>
<ul>
<li>crunch_script.py initializes ArvCwlRunner with ThreadSafeApiCache </li>
<li>Fix "Runner.done" method to use <code>arvrunner.process_done</code> (which has correct locking)</li>
<li>ArvadosJob.update_pipeline_component takes lock to avoid update races</li>
<li>Don't try to install signal handler on background when invoking arv-put via arv-keepdocker</li>
<li>Do install signal handler in main thread to convert SIGTERM into KeyboardInterrupt</li>
</ul>
<p><a class="external" href="https://ci.curoverse.com/job/developer-run-tests/685/">https://ci.curoverse.com/job/developer-run-tests/685/</a></p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=618612018-04-13T13:32:20ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>In Progress</i></li></ul> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=618632018-04-13T14:05:45ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>This LGTM, the only detail that may or may not be an issue is that <code>arv-put</code> logs a stack trace when receiving signals and it assumes that <code>KeyboardInterrupt</code> is "translated" to <code>SystemExit</code> (<code>put.py</code> lines 576-588)</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=618652018-04-13T16:05:17ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Lucas Di Pentima wrote:</p>
<blockquote>
<p>This LGTM, the only detail that may or may not be an issue is that <code>arv-put</code> logs a stack trace when receiving signals and it assumes that <code>KeyboardInterrupt</code> is "translated" to <code>SystemExit</code> (<code>put.py</code> lines 576-588)</p>
</blockquote>
<p>I refactored things a bit, so arv-put and a-c-r use the same code to install the signal handler.</p>
<p>13108-acr-threading-fixes @ <a class="changeset" title="13108: Fix import Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <pamstutz@veritasgenetics.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/8e3adbcf390deaffed7f2449056959252e1a49f4">8e3adbcf390deaffed7f2449056959252e1a49f4</a></p>
<p><a class="external" href="https://ci.curoverse.com/job/developer-run-tests/686/">https://ci.curoverse.com/job/developer-run-tests/686/</a></p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=618672018-04-13T17:04:56ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>This LGTM, thanks!</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=618702018-04-13T20:23:49ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>20-way scatter</p>
<table>
<tr>
<td>disable reuse</td>
<td>synchronous submit</td>
<td>8s</td>
</tr>
<tr>
<td>disable reuse</td>
<td>asynchronous submit</td>
<td>3s</td>
</tr>
<tr>
<td>enable reuse</td>
<td>synchronous submit</td>
<td>4s</td>
</tr>
<tr>
<td>enable reuse</td>
<td>asynchronous submit</td>
<td>3s</td>
</tr>
</table>
<p>100-way scatter</p>
<table>
<tr>
<td>enable reuse</td>
<td>synchronous submit</td>
<td>16s</td>
</tr>
<tr>
<td>enable reuse</td>
<td>asynchronous submit</td>
<td>13s</td>
</tr>
</table>
<p>note: about 6s of that is the workflow engine constructing each scatter job. This isn't parallelized, only the submission to Arvados.</p> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=618802018-04-13T23:09:42ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-2 priority-4 priority-default parent" href="/issues/13351">Bug #13351</a>: Benchmark container_request creation and see if there are opportunities for optimization</i> added</li></ul> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=620782018-04-25T14:53:55ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul> Arvados - Bug #13108: arvados-cwl-runner dispatches container requests very slowlyhttps://dev.arvados.org/issues/13108?journal_id=647502018-07-23T18:52:29ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Release</strong> set to <i>13</i></li></ul>