https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422014-09-30T21:12:20ZArvadosArvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=157142014-09-30T21:12:20ZWard Vandewegeward@curii.com
<ul><li><strong>Subject</strong> changed from <i>[Crunch] job spawns tasks but never runs them</i> to <i>[Crunch] job spawns tasks but never runs them, gets stuck waiting for first </i></li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=157162014-09-30T21:13:10ZWard Vandewegeward@curii.com
<ul><li><strong>Subject</strong> changed from <i>[Crunch] job spawns tasks but never runs them, gets stuck waiting for first </i> to <i>[Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.</i></li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=157172014-09-30T21:16:48ZWard Vandewegeward@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/15717/diff?detail_id=14485">diff</a>)</li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=157182014-09-30T21:22:54ZAbram Connellyabram.connelly@gmail.com
<ul></ul><p>Though it may be unrelated, since the files got dumped into the base directory instead of being split up into their own sub-directories, the blocks associated to individual filenames got interleaved from the collation at the end.</p> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=157802014-10-01T21:03:45ZWard Vandewegeward@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/15780/diff?detail_id=14535">diff</a>)</li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=157812014-10-01T21:11:51ZAbram Connellyabram.connelly@gmail.com
<ul></ul><p>Another clue: I gave 'run-command' the 'PersonToTileLibraryShim' script to run but it didn't exist. The actual script is 'PersonToTileLibraryShim.sh' (in job <a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a>)</p> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=157832014-10-01T22:27:28ZAbram Connellyabram.connelly@gmail.com
<ul></ul><p>On investigation, it looks like line 127 in 'run-command' is causing the issue:</p>
<pre>
122 try:
123 if "task.foreach" in jobp:
124 if arvados.current_task()['sequence'] == 0:
125 var = jobp["task.foreach"]
126 items = get_items(jobp, jobp[var])
127 logging.info("parallelizing on %s with items %s" % (var, items))
128 if items is not None:
129 for i in items:
130 params = copy.copy(jobp)
131 params[var] = i
132 arvados.api().job_tasks().create(body={
133 'job_uuid': arvados.current_job()['uuid'],
</pre>
<p>That is, the 'logging.info("parallelizing on %s with items %s" % (var, items))' line.</p>
<p>Where the 'items' variable is a list of 863 length with each entry of about 60-80 characters. Taking it out causes the job to fail (as expected). Putting it back in and the job hangs. When that 'logging.info(...)' call is in place, execution in run-command does not proceed past that point.</p> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=158062014-10-02T15:22:33ZAbram Connellyabram.connelly@gmail.com
<ul></ul><p>Updating to the latest 'run-command' and 'crunchutil' in the crunch scripts appears to sovle the issue. The pipeline/job is still failing but for 'valid' reasons. The job is not hanging anymore.</p>
<p>Using 'run-command' and 'crunchutil' from git hash 5d006b95fdcbb48afffef8272b4d2071acb5221b.</p>
<p>Here is a diff from the versio of 'run-command' and 'subst.py' I was using to the new version:<br /><pre>
$ diff run-command $HOME/prog/arvados/crunch_scripts/run-command
4c4,9
< logging.basicConfig(level=logging.INFO, format="run-command: %(message)s")
---
>
> logger = logging.getLogger('run-command')
> log_handler = logging.StreamHandler()
> log_handler.setFormatter(logging.Formatter("run-command: %(message)s"))
> logger.addHandler(log_handler)
> logger.setLevel(logging.INFO)
127c132
< logging.info("parallelizing on %s with items %s" % (var, items))
---
> logger.info("parallelizing on %s with items %s" % (var, items))
168,169c173,178
< logging.info("{}{}{}".format(' '.join(cmd), (" < " + stdinname) if stdinname is not None else "", (" > " + stdoutname) if stdoutname is not None else ""))
<
---
> logger.info("{}{}{}".format(' '.join(cmd), (" < " + stdinname) if stdinname is not None else "", (" > " + stdoutname) if stdoutname is not None else ""))
> except subst.SubstitutionError as e:
> logger.error(str(e))
> logger.error("task parameters were:")
> logger.error(pprint.pformat(taskp))
> sys.exit(1)
171,173c180,182
< logging.exception("caught exception")
< logging.error("task parameters was:")
< logging.error(pprint.pformat(taskp))
---
> logger.exception("caught exception")
> logger.error("task parameters were:")
> logger.error(pprint.pformat(taskp))
189c198
< logging.critical("terminating on signal %s" % sig.sig)
---
> logger.critical("terminating on signal %s" % sig.sig)
192c201
< logging.info("completed with exit code %i (%s)" % (rcode, "success" if rcode == 0 else "failed"))
---
> logger.info("completed with exit code %i (%s)" % (rcode, "success" if rcode == 0 else "failed"))
195c204
< logging.exception("caught exception")
---
> logger.exception("caught exception")
205c214
< logging.info("the following output files will be saved to keep:")
---
> logger.info("the following output files will be saved to keep:")
209c218
< logging.info("start writing output to keep")
---
> logger.info("start writing output to keep")
219c228
< outcollection = robust_put.upload(outdir)
---
> outcollection = robust_put.upload(outdir, logger)
$
$
$ diff crunchutil/subst.py $HOME/prog/arvados/crunch_scripts/crunchutil/subst.py
3a4,6
> class SubstitutionError(Exception):
> pass
>
31c34
< raise Exception("Substitution error, mismatched parentheses {}".format(c))
---
> raise SubstitutionError("Substitution error, mismatched parentheses {}".format(c))
49c52
< raise Exception("$(glob): No match on '%s'" % v)
---
> raise SubstitutionError("$(glob): No match on '%s'" % v)
60d62
< #print("c is", c)
62,70c64,75
< if m is not None:
< v = do_substitution(p, c[m[0]+2 : m[1]])
< var = True
< for sub in subs:
< if v.startswith(sub):
< r = subs[sub](v[len(sub):])
< var = False
< break
< if var:
---
> if m is None:
> return c
>
> v = do_substitution(p, c[m[0]+2 : m[1]])
> var = True
> for sub in subs:
> if v.startswith(sub):
> r = subs[sub](v[len(sub):])
> var = False
> break
> if var:
> if v in p:
71a77,82
> else:
> raise SubstitutionError("Unknown variable or function '%s' while performing substitution on '%s'" % (v, c))
> if r is None:
> raise SubstitutionError("Substitution for '%s' is null while performing substitution on '%s'" % (v, c))
> if not isinstance(r, basestring):
> raise SubstitutionError("Substitution for '%s' must be a string while performing substitution on '%s'" % (v, c))
73,75c84
< c = c[:m[0]] + r + c[m[1]+1:]
< else:
< return c
---
> c = c[:m[0]] + r + c[m[1]+1:]
</pre></p> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=158672014-10-03T17:38:33ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Story points</strong> set to <i>1.0</i></li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=159072014-10-03T18:41:23ZWard Vandewegeward@curii.com
<ul><li><strong>Target version</strong> changed from <i>Bug Triage</i> to <i>2014-10-08 sprint</i></li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=160532014-10-07T14:10:45ZTom Cleggtom@curii.com
<ul></ul><p>There's still an unresolved mystery in <code><a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a></code>: after doing approximately nothing for ~4 minutes, crunchstat says "user -1" and then goes silent for 10 minutes until the job is cancelled.</p>
<pre>
2014-10-01_19:15:46 <a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a> 11998 0 stderr crunchstat: cpuacct.stat user 0
2014-10-01_19:15:46 <a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a> 11998 0 stderr crunchstat: cpuacct.stat sys 0
2014-10-01_19:15:56 <a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a> 11998 0 stderr crunchstat: cpuacct.stat user -1
2014-10-01_19:15:56 <a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a> 11998 0 stderr crunchstat: cpuacct.stat sys 0
2014-10-01_21:25:10 <a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a> 11998 Job cancelled at 2014-10-01T21:25:10Z by user <a href="https://arvadosapi.com/9tee4-tpzed-2l9nsfm4mgltv4c">9tee4-tpzed-2l9nsfm4mgltv4c</a>
2014-10-01_21:25:10 <a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a> 11998 wait for last 1 children to finish
2014-10-01_21:25:10 <a href="https://arvadosapi.com/9tee4-8i9sb-50r7da1j0ks4db1">9tee4-8i9sb-50r7da1j0ks4db1</a> 11998 0 sending 2x signal 2 to pid 12429
</pre>
<p>Reading <code>crunchstat.go</code> it seems that if cpuacct/stat starts saying "user 0" -- or, more likely, an error occurs during os.Open, iouitl.ReadAll, or fmt.Sscanf -- crunch-stat will print a negative CPU usage and then recede into "don't output anything" mode until cpuacct/stat starts reporting non-zero numbers in the expected format.</p>
<p>IIRC it was reported that the docker container was not running during this 10-minute period. This would be consistent with the "cpuacct/stat cannot be opened+read+parsed" behavior. This suggests crunchstat failed to notice that its child had exited.</p>
<p>Should we put the <code>cmd.Wait()</code> block before the 2× <code><-finish_chan</code>?</p> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=161142014-10-07T20:47:09ZTom Cleggtom@curii.com
<ul></ul><p>4044-crunchstat-wait @ <a class="changeset" title="4044: Clean up channel and pipe usage. * Remove code for using one goroutine to copy stdout and ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/292856a6f1d26dcfd7a652a9a83cbe4a375588d4">292856a</a></p>
<p>This branch fixes/simplifies a bunch of pipe/channel stuff in crunchstat. I haven't found a way to reproduce the hang, but I did fix a few bugs that will <em>at least</em> prevent crunchstat from going silent (and reporting negative CPU usage) due to unreadable cgroup files. I also removed some unnecessary attempts to synchronize stdout/stderr channels that could conceivably contribute to a deadlock condition.</p>
<p>Here's a quick example of testing crunchstat using the "ubuntu" docker image, in the absence of a proper test suite:<br /><pre>
GOPATH="/tmp/tmp.O0vXh97Iqx" go get git.curoverse.com/arvados.git/services/crunchstat
rm /tmp/cid
echo "I am stdin" | \
/tmp/tmp.O0vXh97Iqx/bin/crunchstat -cgroup-root=/sys/fs/cgroup -cgroup-parent=docker -cgroup-cid=/tmp/cid -poll=1000 \
docker run --rm=true --cidfile=/tmp/cid --attach=stdin --attach=stdout --attach=stderr -i ubuntu \
stdbuf --output=0 --error=0 --input=0 \
sh -c 'wc >&2; sleep 3; echo ok'; echo DONE
</pre></p> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=161152014-10-07T20:47:20ZTom Cleggtom@curii.com
<ul><li><strong>Category</strong> set to <i>Crunch</i></li><li><strong>Assigned To</strong> set to <i>Tom Clegg</i></li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=161332014-10-08T13:19:15ZTom Cleggtom@curii.com
<ul></ul><p>Still looking for consistent results sending data into <code>docker run</code> on stdin:</p>
<pre>
tom.shell.9tee4:~# docker -v
Docker version 1.1.2, build d84a070
tom.shell.9tee4:~# echo foo | docker run -a stdin -a stdout ubuntu wc
0 0 0
tom.shell.9tee4:~# echo foo | docker run -i -a stdout ubuntu wc
[hang, ^C ignored, ^P ignored -> docker kill]
tom.shell.9tee4:~# echo foo | docker run -i -a stdout ubuntu wc
[hang, ^C ignored, ^P ignored -> docker kill]
tom.shell.9tee4:~# echo foo | docker run -i -a stdout ubuntu wc
[hang, ^C ignored, ^P ignored -> docker kill]
tom.shell.9tee4:~# echo foo | docker run -a stdin -a stdout ubuntu wc
0 0 0
</pre>
<p>Aha(?):</p>
<pre>
tom.shell.9tee4:~# echo foo | docker run -i -a stdin -a stdout ubuntu wc
1 1 4
</pre>
It seems that:
<table>
<tr>
<td><code>docker run</code> flags</td>
<td>result</td>
</tr>
<tr>
<td><code>-i</code></td>
<td>hang</td>
</tr>
<tr>
<td><code>-a stdin</code></td>
<td>stdin is empty in container</td>
</tr>
<tr>
<td><code>-i -a stdin</code></td>
<td>stdin works</td>
</tr>
</table> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=161422014-10-08T14:39:26ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Looking at <a class="changeset" title="4044: Clean up channel and pipe usage. * Remove code for using one goroutine to copy stdout and ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/292856a6f1d26dcfd7a652a9a83cbe4a375588d4">292856a</a></p>
<ul>
<li>I suspect that we're going to get stderr and stdout stepping on each other in the logs, at least until we do <a class="issue tracker-1 status-5 priority-4 priority-default closed parent" title="Bug: [Crunch] Task stdout appearing in job logs should be labeled "stdout" instead of "stderr" (Closed)" href="https://dev.arvados.org/issues/4028">#4028</a> so that the "synchronize lines" logic can happen downstream. I guess we will find out if this is actually a problem or not.</li>
<li>If there's an Open() error reading any of the cgroup stat files, it uses "continue" to go back to the top of the loop and won't do any of the subsequent stats.</li>
</ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=161432014-10-08T14:55:11ZTom Cleggtom@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>Looking at <a class="changeset" title="4044: Clean up channel and pipe usage. * Remove code for using one goroutine to copy stdout and ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/292856a6f1d26dcfd7a652a9a83cbe4a375588d4">292856a</a></p>
<ul>
<li>I suspect that we're going to get stderr and stdout stepping on each other in the logs, at least until we do <a class="issue tracker-1 status-5 priority-4 priority-default closed parent" title="Bug: [Crunch] Task stdout appearing in job logs should be labeled "stdout" instead of "stderr" (Closed)" href="https://dev.arvados.org/issues/4028">#4028</a> so that the "synchronize lines" logic can happen downstream. I guess we will find out if this is actually a problem or not.</li>
</ul>
</blockquote>
<p>Indeed, there's a risk that log lines will get mangled by crunch-job when tasks write to stdout. But crunchstat is the wrong place to tackle that issue. (For one thing, even if crunchstat is careful to write whole lines to stderr and stdout, there's still no guarantee that crunch-job won't mangle them.)</p>
<blockquote>
<ul>
<li>If there's an Open() error reading any of the cgroup stat files, it uses "continue" to go back to the top of the loop and won't do any of the subsequent stats.</li>
</ul>
</blockquote>
<p>Yes. This probably means the cgroup no longer exists (or something else is terribly wrong) and there's no point writing N error messages per time interval. Added comments @ <a class="changeset" title="4044: Add comments to "continue" statements." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/5ed4ae4302e70504580dafea4fd299757107b04f">5ed4ae4</a></p> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=162232014-10-08T19:05:31ZTom Cleggtom@curii.com
<ul><li><strong>Target version</strong> changed from <i>2014-10-08 sprint</i> to <i>2014-10-29 sprint</i></li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=162282014-10-08T19:10:40ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul> Arvados - Bug #4044: [Crunch] job spawns tasks but never runs them, gets stuck waiting for first task to end, even though it has already ended.https://dev.arvados.org/issues/4044?journal_id=172312014-10-30T17:11:16ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul>