Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-11-22T21:25:38Z</p> <ul><li><strong>Target version</strong> set to <i>2016-12-14 sprint</i></li></ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-11-23T20:10:07Z</p> <ul><li><strong>Assigned To</strong> set to <i>Tom Clegg</i></li></ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-11-28T20:15:45Z</p> <ul></ul><p>Hi Josh, we have some questions:</p> <ul> <li>Is there an arv-mount process still running, or did it crash?</li> <li>Is there any evidence that arv-mount was out-of-memory killed?</li> <li>Are there any logs indicating what happened to arv-mount?</li> <li>By "running" do you mean crunch-job is still running and expecting the job to complete, or that crunch-job finished but the docker container did not exit?</li> <li>What processes are running inside the Docker containers that are "not doing anything"? Can you use strace to find out what they are doing / where they are stuck?</li> <li>Is crunchstat still running?</li> </ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-11-28T20:33:58Z</p> <ul></ul><p>This was one of two phenotypes that was causing lockups across our cluster (the other being 10586). For this one, the arv-mount process was no longer running.</p> <p>I didn't see any evidence of arv-mount being OOM killed, and there is nothing at all in kern.log* from the OOM killer.</p> <p>I didn't find any logs regarding what happened to arv-mount but I didn't look "everywhere" for it. Unfortunately any of the other places I would have looked (crunch-dispatch-jobs log or syslog) have been rotated out at this point.</p> <p>The docker container was still running, my crunch script was still running, GATK was still running, crunchstat was still running, and crunch-job was still running and forwarding crunchstat output all the way back to the job log. It took me a while to realise that jobs were stuck because crunchstat output was still coming through in the job logs, and only on closer inspection did I find that there was no actual job output interspersed with them. From what I could tell, the only thing that was supposed to be running but wasn't was arv-mount (and as a result, the I/O on /keep from inside the container was blocking).</p> <p>I killed all these containers last week, but we did a lot of strace investigating before killing them. My crunch script was waiting for output from GATK. GATK appeared to be stuck waiting on I/O of some kind (but it is hard to tell exactly what because Java). I assume it was waiting on read operations on one of its inputs (all of which were in keep).</p> <p>Yes, as above, crunchstat was still running and merrily reporting the (lack of) activity.</p> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-14T16:27:47Z</p> <ul></ul><p>Sounds like when arv-mount dies unexpectedly we go from</p> <pre> ├──dockerd───docker-container───docker-container───GATK │ ├──slurmd───slurmstepd───arv-mount───crunchstat───docker-run </pre> <p>to</p> <pre> ├──dockerd───docker-container───docker-container───GATK │ ├──slurmd───slurmstepd───crunchstat───docker-run </pre> ...and if GATK is waiting forever for FUSE (because FUSE), then <ul> <li>docker-run will wait forever for the container to exit</li> <li>the job will be "running" forever</li> </ul> <p>If we assume the worst -- arv-mount gets killed so forcefully that it has no opportunity to clean up -- we could:</p> 1. Insert <em>another</em> process before arv-mount that cleans things up after arv-mount exits. <ul> <li><code>fusermount -u -z /keep</code></li> <li>kill the docker container (we already use <code>docker run --name=$containername</code> so we should be able to pass the same name to the watchdog so it can choose the right one)</li> </ul> <p>or</p> 2. Fork a watchdog process from arv-mount at startup (before starting any threads) which can clean up (as above) if the main process dies. <ul> <li>Easy enough to communicate the mount point to the child -- but what if the mount fails and the mount point belongs to a <em>different</em> fuse process when the watchdog tries to unmount it? Should the main process sigkill the watchdog if exiting when cleanup isn't appropriate?</li> <li>How does the watchdog know which docker container to kill? crunch-job could pass an entire cleanup command to arv-mount for the watchdog to use. Seems fragile.</li> </ul> <p>or</p> 3. Have crunchstat monitor its parent PID and, if the parent dies, kill crunchstat's child with SIGTERM (which should stop the container, being the default value for <code>docker run --stop-signal</code>) and exit. <ul> <li><code>-signal-on-dead-ppid=15</code> ?</li> </ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-14T18:07:24Z</p> <ul></ul><p>If GATK is in an uninterruptible I/O wait, killing the container might not be enough, keep still needs to be unmounted (possibly with umount -f)</p> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-14T20:03:06Z</p> <ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li><li><strong>Target version</strong> changed from <i>2016-12-14 sprint</i> to <i>2017-01-04 sprint</i></li><li><strong>Story points</strong> set to <i>0.5</i></li></ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-16T21:19:54Z</p> <ul></ul><p>10585-crunchstat-lost-parent @ <a class="changeset" title="10585: Add crunchstat -signal-on-dead-ppid option." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/fc390927833d14b6c439db8ea72d3d52b60a5e6d">fc390927833d14b6c439db8ea72d3d52b60a5e6d</a></p> <p>Worst test case ever. :(</p> <blockquote> <p>If GATK is in an uninterruptible I/O wait, killing the container might not be enough, keep still needs to be unmounted (possibly with umount -f)</p> </blockquote> <p>It does seem plausible that docker won't do a good job in some situations, but even so I think this feature makes sense for the situations where <code>docker run --stop-signal</code> does what it claims to do.</p> Even if the container is still wedged in some fuse state, just ending the <code>docker run</code> process itself should be a big improvement over the reported behavior: <ul> <li>The affected job will end (so it can be retried) instead of hanging forever.</li> <li>Next time we try to run a job on this node, crunch-job's "fusermount -u -z" stuff will potentially help clean things up.</li> </ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-19T18:48:46Z</p> <ul></ul><ul> <li>Is there a reason for the default values of <code>signalOnDeadPPID</code> & <code>ppidCheckInterval</code> vars to be defined in different places?</li> <li>If the default behaviour is to signal the child process with 15, maybe adding a note on the argument documentation string that say something like “Use zero to disable” would be helpful.</li> <li>On func <code>sendSignalOnDeadPPID()</code>, the monitored PID is passed by argument but the check interval is read from the global variable, wouldn’t be more consistent to give the same treatment to both values?</li> <li>What happens if the user passes a negative signal number? I’ve searched golang’s docs but couldn’t find a clear answer to this case.</li> <li>Is it OK to break from the loop leaving the Ticker running? (Line 126)</li> </ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-22T06:34:45Z</p> <ul></ul><p>Lucas Di Pentima wrote:</p> <blockquote> <ul> <li>Is there a reason for the default values of <code>signalOnDeadPPID</code> & <code>ppidCheckInterval</code> vars to be defined in different places?</li> <li>On func <code>sendSignalOnDeadPPID()</code>, the monitored PID is passed by argument but the check interval is read from the global variable, wouldn’t be more consistent to give the same treatment to both values?</li> </ul> </blockquote> <p>Good points, thanks. Fixed both of these to be consistent.</p> <blockquote> <ul> <li>If the default behaviour is to signal the child process with 15, maybe adding a note on the argument documentation string that say something like “Use zero to disable” would be helpful.</li> </ul> </blockquote> <p>Good call, added.</p> <blockquote> <ul> <li>What happens if the user passes a negative signal number? I’ve searched golang’s docs but couldn’t find a clear answer to this case.</li> </ul> </blockquote> <p>I think we'd get "error: sending signal -4: invalid argument" or something. I've added a check for negative numbers in case someone tries to use -1 to disable. I considered checking for too-large numbers too, but that starts to feel a bit too much like second-guessing the OS's list of signals.</p> <blockquote> <ul> <li>Is it OK to break from the loop leaving the Ticker running? (Line 126)</li> </ul> </blockquote> <p>I don't think that would break anything (tickers don't accumulate unbounded backlogs of ticks etc) but it does seem like good form to clean up anyway so I've updated that.</p> <p><a class="changeset" title="10585: Clean up defaults and error checks; release ticker when finished." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/cc94954f69ed2d26451bae6610b38de260d2252f">cc94954f69ed2d26451bae6610b38de260d2252f</a></p> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-22T15:11:11Z</p> <ul></ul><p>LGTM. Ran <code>services/crunchstat</code> tests locally without issues.</p> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-22T15:23:19Z</p> <ul></ul>Now that -signal-on-dead-ppid=15 is the default, we should see an improvement here: <ul> <li>If arv-mount dies, crunchstat will signal docker to stop the container.</li> <li><em>Assuming things aren't too wedged for even SIGTERM to make `docker run` stop,</em> this will cause slurmstepd to finish the task, and return arv-mount's non-zero exit status to crunch-job</li> <li>crunch-job will either reattempt the task or give up, as with other failures.</li> </ul> TODO: <ul> <li>In crunch2, crunch-run could do the analogous thing, "notice if arv-mount dies, and kill the container." (In crunch2, crunch-run starts arv-mount, instead of the other way around.) (this is now <a class="issue tracker-1 status-3 priority-4 priority-default closed parent" title="Bug: [Crunch2] crunch-run: stop the container and fail if arv-mount dies before the container finishes (Resolved)" href="https://dev.arvados.org/issues/10777">#10777</a>)</li> </ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2016-12-28T15:17:32Z</p> <ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Feedback</i></li></ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2017-01-04T20:06:54Z</p> <ul><li><strong>Target version</strong> changed from <i>2017-01-04 sprint</i> to <i>2017-01-18 sprint</i></li><li><strong>Story points</strong> changed from <i>0.5</i> to <i>0.0</i></li></ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2017-01-18T18:53:05Z</p> <ul><li><strong>Target version</strong> changed from <i>2017-01-18 sprint</i> to <i>2017-02-01 sprint</i></li></ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2017-02-01T20:08:06Z</p> <ul><li><strong>Target version</strong> changed from <i>2017-02-01 sprint</i> to <i>2017-02-15 sprint</i></li></ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2017-02-15T20:01:33Z</p> <ul><li><strong>Target version</strong> changed from <i>2017-02-15 sprint</i> to <i>2017-03-01 sprint</i></li></ul> </article> <article> <h1>Arvados - Bug #10585: crunch doesn't end jobs when their arv-mount dies</h1> <p>2017-03-01T20:22:08Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Resolved</i></li></ul> </article> </main></body></html>