https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422016-11-22T21:25:38ZArvadosArvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=456152016-11-22T21:25:38ZTom Morristfmorris@veritasgenetics.com
<ul><li><strong>Target version</strong> set to <i>2016-12-14 sprint</i></li></ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=457332016-11-23T20:10:07ZTom Cleggtom@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Tom Clegg</i></li></ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=458072016-11-28T20:15:45ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>Hi Josh, we have some questions:</p>
<ul>
<li>Is there an arv-mount process still running, or did it crash?</li>
<li>Is there any evidence that arv-mount was out-of-memory killed?</li>
<li>Are there any logs indicating what happened to arv-mount?</li>
<li>By "running" do you mean crunch-job is still running and expecting the job to complete, or that crunch-job finished but the docker container did not exit?</li>
<li>What processes are running inside the Docker containers that are "not doing anything"? Can you use strace to find out what they are doing / where they are stuck?</li>
<li>Is crunchstat still running?</li>
</ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=458102016-11-28T20:33:58ZJoshua Randalljr17@sanger.ac.uk
<ul></ul><p>This was one of two phenotypes that was causing lockups across our cluster (the other being 10586). For this one, the arv-mount process was no longer running.</p>
<p>I didn't see any evidence of arv-mount being OOM killed, and there is nothing at all in kern.log* from the OOM killer.</p>
<p>I didn't find any logs regarding what happened to arv-mount but I didn't look "everywhere" for it. Unfortunately any of the other places I would have looked (crunch-dispatch-jobs log or syslog) have been rotated out at this point.</p>
<p>The docker container was still running, my crunch script was still running, GATK was still running, crunchstat was still running, and crunch-job was still running and forwarding crunchstat output all the way back to the job log. It took me a while to realise that jobs were stuck because crunchstat output was still coming through in the job logs, and only on closer inspection did I find that there was no actual job output interspersed with them. From what I could tell, the only thing that was supposed to be running but wasn't was arv-mount (and as a result, the I/O on /keep from inside the container was blocking).</p>
<p>I killed all these containers last week, but we did a lot of strace investigating before killing them. My crunch script was waiting for output from GATK. GATK appeared to be stuck waiting on I/O of some kind (but it is hard to tell exactly what because Java). I assume it was waiting on read operations on one of its inputs (all of which were in keep).</p>
<p>Yes, as above, crunchstat was still running and merrily reporting the (lack of) activity.</p> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=463162016-12-14T16:27:47ZTom Cleggtom@curii.com
<ul></ul><p>Sounds like when arv-mount dies unexpectedly we go from</p>
<pre>
├──dockerd───docker-container───docker-container───GATK
│
├──slurmd───slurmstepd───arv-mount───crunchstat───docker-run
</pre>
<p>to</p>
<pre>
├──dockerd───docker-container───docker-container───GATK
│
├──slurmd───slurmstepd───crunchstat───docker-run
</pre>
...and if GATK is waiting forever for FUSE (because FUSE), then
<ul>
<li>docker-run will wait forever for the container to exit</li>
<li>the job will be "running" forever</li>
</ul>
<p>If we assume the worst -- arv-mount gets killed so forcefully that it has no opportunity to clean up -- we could:</p>
1. Insert <em>another</em> process before arv-mount that cleans things up after arv-mount exits.
<ul>
<li><code>fusermount -u -z /keep</code></li>
<li>kill the docker container (we already use <code>docker run --name=$containername</code> so we should be able to pass the same name to the watchdog so it can choose the right one)</li>
</ul>
<p>or</p>
2. Fork a watchdog process from arv-mount at startup (before starting any threads) which can clean up (as above) if the main process dies.
<ul>
<li>Easy enough to communicate the mount point to the child -- but what if the mount fails and the mount point belongs to a <em>different</em> fuse process when the watchdog tries to unmount it? Should the main process sigkill the watchdog if exiting when cleanup isn't appropriate?</li>
<li>How does the watchdog know which docker container to kill? crunch-job could pass an entire cleanup command to arv-mount for the watchdog to use. Seems fragile.</li>
</ul>
<p>or</p>
3. Have crunchstat monitor its parent PID and, if the parent dies, kill crunchstat's child with SIGTERM (which should stop the container, being the default value for <code>docker run --stop-signal</code>) and exit.
<ul>
<li><code>-signal-on-dead-ppid=15</code> ?</li>
</ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=463222016-12-14T18:07:24ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>If GATK is in an uninterruptible I/O wait, killing the container might not be enough, keep still needs to be unmounted (possibly with umount -f)</p> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=463372016-12-14T20:03:06ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Target version</strong> changed from <i>2016-12-14 sprint</i> to <i>2017-01-04 sprint</i></li><li><strong>Story points</strong> set to <i>0.5</i></li></ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=465102016-12-16T21:19:54ZTom Cleggtom@curii.com
<ul></ul><p>10585-crunchstat-lost-parent @ <a class="changeset" title="10585: Add crunchstat -signal-on-dead-ppid option." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/fc390927833d14b6c439db8ea72d3d52b60a5e6d">fc390927833d14b6c439db8ea72d3d52b60a5e6d</a></p>
<p>Worst test case ever. :(</p>
<blockquote>
<p>If GATK is in an uninterruptible I/O wait, killing the container might not be enough, keep still needs to be unmounted (possibly with umount -f)</p>
</blockquote>
<p>It does seem plausible that docker won't do a good job in some situations, but even so I think this feature makes sense for the situations where <code>docker run --stop-signal</code> does what it claims to do.</p>
Even if the container is still wedged in some fuse state, just ending the <code>docker run</code> process itself should be a big improvement over the reported behavior:
<ul>
<li>The affected job will end (so it can be retried) instead of hanging forever.</li>
<li>Next time we try to run a job on this node, crunch-job's "fusermount -u -z" stuff will potentially help clean things up.</li>
</ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=465202016-12-19T18:48:46ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><ul>
<li>Is there a reason for the default values of <code>signalOnDeadPPID</code> & <code>ppidCheckInterval</code> vars to be defined in different places?</li>
<li>If the default behaviour is to signal the child process with 15, maybe adding a note on the argument documentation string that say something like “Use zero to disable” would be helpful.</li>
<li>On func <code>sendSignalOnDeadPPID()</code>, the monitored PID is passed by argument but the check interval is read from the global variable, wouldn’t be more consistent to give the same treatment to both values?</li>
<li>What happens if the user passes a negative signal number? I’ve searched golang’s docs but couldn’t find a clear answer to this case.</li>
<li>Is it OK to break from the loop leaving the Ticker running? (Line 126)</li>
</ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=465652016-12-22T06:34:45ZTom Cleggtom@curii.com
<ul></ul><p>Lucas Di Pentima wrote:</p>
<blockquote>
<ul>
<li>Is there a reason for the default values of <code>signalOnDeadPPID</code> & <code>ppidCheckInterval</code> vars to be defined in different places?</li>
<li>On func <code>sendSignalOnDeadPPID()</code>, the monitored PID is passed by argument but the check interval is read from the global variable, wouldn’t be more consistent to give the same treatment to both values?</li>
</ul>
</blockquote>
<p>Good points, thanks. Fixed both of these to be consistent.</p>
<blockquote>
<ul>
<li>If the default behaviour is to signal the child process with 15, maybe adding a note on the argument documentation string that say something like “Use zero to disable” would be helpful.</li>
</ul>
</blockquote>
<p>Good call, added.</p>
<blockquote>
<ul>
<li>What happens if the user passes a negative signal number? I’ve searched golang’s docs but couldn’t find a clear answer to this case.</li>
</ul>
</blockquote>
<p>I think we'd get "error: sending signal -4: invalid argument" or something. I've added a check for negative numbers in case someone tries to use -1 to disable. I considered checking for too-large numbers too, but that starts to feel a bit too much like second-guessing the OS's list of signals.</p>
<blockquote>
<ul>
<li>Is it OK to break from the loop leaving the Ticker running? (Line 126)</li>
</ul>
</blockquote>
<p>I don't think that would break anything (tickers don't accumulate unbounded backlogs of ticks etc) but it does seem like good form to clean up anyway so I've updated that.</p>
<p><a class="changeset" title="10585: Clean up defaults and error checks; release ticker when finished." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/cc94954f69ed2d26451bae6610b38de260d2252f">cc94954f69ed2d26451bae6610b38de260d2252f</a></p> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=465682016-12-22T15:11:11ZLucas Di Pentimalucas.dipentima@curii.com
<ul></ul><p>LGTM. Ran <code>services/crunchstat</code> tests locally without issues.</p> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=465712016-12-22T15:23:19ZTom Cleggtom@curii.com
<ul></ul>Now that -signal-on-dead-ppid=15 is the default, we should see an improvement here:
<ul>
<li>If arv-mount dies, crunchstat will signal docker to stop the container.</li>
<li><em>Assuming things aren't too wedged for even SIGTERM to make `docker run` stop,</em> this will cause slurmstepd to finish the task, and return arv-mount's non-zero exit status to crunch-job</li>
<li>crunch-job will either reattempt the task or give up, as with other failures.</li>
</ul>
TODO:
<ul>
<li>In crunch2, crunch-run could do the analogous thing, "notice if arv-mount dies, and kill the container." (In crunch2, crunch-run starts arv-mount, instead of the other way around.) (this is now <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: [Crunch2] crunch-run: stop the container and fail if arv-mount dies before the container finishes (Resolved)" href="https://dev.arvados.org/issues/10777">#10777</a>)</li>
</ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=466372016-12-28T15:17:32ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=468452017-01-04T20:06:54ZTom Cleggtom@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-01-04 sprint</i> to <i>2017-01-18 sprint</i></li><li><strong>Story points</strong> changed from <i>0.5</i> to <i>0.0</i></li></ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=473742017-01-18T18:53:05ZTom Cleggtom@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-01-18 sprint</i> to <i>2017-02-01 sprint</i></li></ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=479042017-02-01T20:08:06ZTom Cleggtom@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-02-01 sprint</i> to <i>2017-02-15 sprint</i></li></ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=483692017-02-15T20:01:33ZTom Cleggtom@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-02-15 sprint</i> to <i>2017-03-01 sprint</i></li></ul> Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dieshttps://dev.arvados.org/issues/10585?journal_id=488652017-03-01T20:22:08ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul>