https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422017-01-04T18:32:19ZArvadosArvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=468212017-01-04T18:32:19ZWard Vandewegeward@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/46821/diff?detail_id=45083">diff</a>)</li></ul> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=468222017-01-04T18:34:51ZWard Vandewegeward@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/46822/diff?detail_id=45084">diff</a>)</li></ul> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=468232017-01-04T18:35:19ZWard Vandewegeward@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/46823/diff?detail_id=45085">diff</a>)</li></ul> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=468242017-01-04T18:43:04ZWard Vandewegeward@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/46824/diff?detail_id=45086">diff</a>)</li></ul> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=469102017-01-04T23:12:32ZTom Cleggtom@curii.com
<ul></ul><p>10808-file-cache-ownership @ <a class="changeset" title="10808: Avoid using the disk cache if a different user owns it (e.g., running a rake task or crunc..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/d7b27f798a0298f5508842c5f7f03b8fccafa3ab">d7b27f798a0298f5508842c5f7f03b8fccafa3ab</a></p> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=469112017-01-04T23:12:42ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=469122017-01-05T02:06:17ZWard Vandewegeward@curii.com
<ul></ul><p>Tom Clegg wrote:</p>
<blockquote>
<p>10808-file-cache-ownership @ <a class="changeset" title="10808: Avoid using the disk cache if a different user owns it (e.g., running a rake task or crunc..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/d7b27f798a0298f5508842c5f7f03b8fccafa3ab">d7b27f798a0298f5508842c5f7f03b8fccafa3ab</a></p>
</blockquote>
<p>Background: Tom found</p>
<pre>
#<Errno::EACCES: Permission denied @ unlink_internal - /var/www/arvados-api/current/tmp/cache/94F/070/SweepTrashedCollections>
</pre>
<p>in the API server logs. SweepTrashedCollections is new, it runs whenever Collection.where() is executed.</p>
<p>Clearly, something running as root created the cached copy on c97qk by executing Collection.where(). First we suspected the rake call during package installation, but our postinst script is already very careful to fix the cache dir ownership.</p>
<p>Then we looked at crunch-dispatch, which does still run as root (bleh).</p>
<p>Tom came up with the patch in this branch to avoid getting into the state where cache dir ownership is screwed up.</p>
<p>I have applied this patch on c97qk and confirmed that the if block is executed for crunch-dispatch (which currently runs as root, :/), but does not trigger for the rails processes running under Passenger, as the www-data user (on debian-based systems).</p>
<p>I removed the offending cache directory with ownership root and restarted crunch-dispatch-jobs and crunch-dispatch-pipelines.</p>
<p>After a few minutes, a new SweepCacheCollections file appeared but this time owned by www-data, which means it must have been written by the API server not crunch-dispatch:</p>
<pre>
c97qk:/etc/service# v /var/www/arvados-api/current/tmp/cache/94F
total 4
drwxr-sr-x 3 www-data www-data 16 Jan 5 01:56 ./
drwxrwsr-x 23 www-data www-data 4096 Jan 5 01:56 ../
drwxr-sr-x 2 www-data www-data 36 Jan 5 01:56 070/
c97qk:/etc/service# v /var/www/arvados-api/current/tmp/cache/94F/070/
total 4
drwxr-sr-x 2 www-data www-data 36 Jan 5 01:56 ./
drwxr-sr-x 3 www-data www-data 16 Jan 5 01:56 ../
-rw-r--r-- 1 www-data www-data 111 Jan 5 01:56 SweepTrashedCollections
</pre>
<p>In other words, this looks good to me and I would like to see the patch merged. Thanks!</p> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=469132017-01-05T02:16:41ZWard Vandewegeward@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/46913/diff?detail_id=45183">diff</a>)</li><li><strong>Assigned To</strong> set to <i>Ward Vandewege</i></li><li><strong>Target version</strong> set to <i>2017-01-18 sprint</i></li></ul> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=469182017-01-05T15:22:52ZTom Cleggtom@curii.com
<ul></ul>More detail on the "cancel button doesn't work" problem:
<ul>
<li>Even after the cache dir is fixed, clicking the "Cancel" button doesn't cancel the job.</li>
<li>No error message or other user feedback.</li>
<li>Chrome debugger shows <pre>
Request URL:https://workbench.c97qk.arvadosapi.com/jobs/c97qk-8i9sb-bj9c3ojdng85osz/cancel
Request Method:POST
Status Code:422 Unprocessable Entity
HTTP/1.1 422 Unprocessable Entity
Server: nginx/1.8.0
Date: Thu, 05 Jan 2017 15:19:48 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Status: 422 Unprocessable Entity
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Strict-Transport-Security: max-age=31536000
Pragma: no-cache
X-XSS-Protection: 1; mode=block
X-Request-Id: dff5a12c-afa9-424e-a2e6-62ccbae1b1ef
X-Runtime: 0.302502
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
Expires: Fri, 01 Jan 1990 00:00:00 GMT
X-Powered-By: Phusion Passenger 5.0.16
{"success":false,"errors":["#\u003cArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError\u003e"]}
</pre></li>
</ul>
API log:
<ul>
<li><pre>
User <a href="https://arvadosapi.com/c97qk-tpzed-7o2gfjd1p10dq9r">c97qk-tpzed-7o2gfjd1p10dq9r</a> tried to change protected job attributes on locked Job <a href="https://arvadosapi.com/c97qk-8i9sb-bj9c3ojdng85osz">c97qk-8i9sb-bj9c3ojdng85osz</a>
#<ArvadosModel::PermissionDeniedError: ArvadosModel::PermissionDeniedError>
/data-sdc/var-www/arvados-api/current/app/models/arvados_model.rb:370:in `ensure_permission_to_save'
...
Error 1483629588+6489a86c: 403
{"method":"POST","path":"/arvados/v1/jobs/c97qk-8i9sb-bj9c3ojdng85osz/cancel","format":"*/*","controller":"arvados/v1/jobs","action":"cancel","status":403,"duration":11.66,"view":0.14,"db":3.06,..."@timestamp":"2017-01-05T15:19:48Z"...
</pre></li>
</ul> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=469212017-01-05T15:57:31ZTom Cleggtom@curii.com
<ul></ul><p>10808-admin-cancel-job @ <a class="changeset" title="10808: Exempt "change state to Cancelled" from "locked by uuid" protection." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/4a00ceae73e8d76affe6b646832c525355e7897c">4a00ceae73e8d76affe6b646832c525355e7897c</a> <a class="external" href="https://ci.curoverse.com/job/developer-run-tests/127/">https://ci.curoverse.com/job/developer-run-tests/127/</a></p> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=469302017-01-05T17:04:54ZRadhika Chippadaradhika@curoverse.com
<ul></ul><p>Just a nit. It might be easier to follow the intent of the code block if the comment “If we don't own the cache dir …” is placed above the default_cache_path declaration in application.rb</p>
<p>LGTM</p> Arvados - Bug #10808: [Crunch] stuck job and pipeline instances on c97qkhttps://dev.arvados.org/issues/10808?journal_id=472692017-01-18T02:07:11ZWard Vandewegeward@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul>