https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422017-05-05T12:28:33ZArvadosArvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=514222017-05-05T12:28:33ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/51422/diff?detail_id=49587">diff</a>)</li></ul> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=514452017-05-08T13:25:44ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> set to <i>2017-05-10 sprint</i></li></ul> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=514462017-05-08T13:25:50ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=514472017-05-08T13:26:33ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Assigned To</strong> set to <i>Peter Amstutz</i></li></ul> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=514862017-05-08T19:55:51ZTom Cleggtom@curii.com
<ul></ul><p>I'm not sure why the HasUUID() check is added in run() in crunch-dispatch-slurm.go. ctx.Done() fires as soon as HasUUID() returns false, so it seems like it could only make a difference during a race between <-status and <-ctx.Done() -- the rest of the time it just introduces an unnecessary polling delay between dispatch telling crunch-dispatch-slurm to cancel and c-d-s actually running scancel.</p>
<p>Also not sure why we're adding the extra call to tracker.update(c) before tracker.close(). The dispatch library is supposed to reduce the amount of state/priority logic each dispatcher has to implement -- if the idea here is just to log every state change, perhaps it should go in dispatch.go instead?</p>
<p>event_type="dispatch" might be better than "crunch-dispatch-slurm" for filtering.</p>
<p>In TestIntegrationNormal, why is the *bool better than a bool?</p> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=515192017-05-09T20:43:30ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>However I've rebased out the HasUUID changes so we can discuss it separately.</p>
<p>Now at @ <a class="changeset" title="11626: Log sbatch errors where the user can see them. Fix tests so there is no lingering gorouti..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f8675ad473b45387b1286c6b7a41edf36148ebac">f8675ad473b45387b1286c6b7a41edf36148ebac</a></p>
<p>(note-5 refering to <a class="changeset" title="11626: Don't run scancel if the job is already gone from the queue. Fix tests so there is no lin..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/26d7e1809376534b9060534e25399cbd462f38da">26d7e1809376534b9060534e25399cbd462f38da</a>)</p>
<p>Hhe problem I was trying to solve was that the logs are full of this:</p>
<pre>
2017-05-08_14:09:51.64626 2017/05/08 14:09:51 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:09:52.35643 2017/05/08 14:09:52 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:09:53.35654 2017/05/08 14:09:53 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:09:54.35664 2017/05/08 14:09:54 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:09:55.35680 2017/05/08 14:09:55 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:09:56.35509 2017/05/08 14:09:56 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:09:57.35538 2017/05/08 14:09:57 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:09:58.35661 2017/05/08 14:09:58 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:09:59.35681 2017/05/08 14:09:59 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:10:00.35669 2017/05/08 14:10:00 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:10:01.35804 2017/05/08 14:10:01 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:10:02.35522 2017/05/08 14:10:02 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:10:03.35664 2017/05/08 14:10:03 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:10:04.35882 2017/05/08 14:10:04 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:10:05.35906 2017/05/08 14:10:05 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:10:06.36192 2017/05/08 14:10:06 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:10:07.36220 2017/05/08 14:10:07 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:10:08.36188 2017/05/08 14:10:08 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
2017-05-08_14:10:09.36211 2017/05/08 14:10:09 Dispatcher says container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is done: cancel slurm job
2017-05-08_14:10:10.35595 2017/05/08 14:10:10 container <a href="https://arvadosapi.com/qr1hi-dz642-tjezupwylh67k2b">qr1hi-dz642-tjezupwylh67k2b</a> is still in squeue after scancel
</pre>
<p>This is because once the channel is closed, it just keeps hitting scancel() with no delay in between. So I thought it could at least wait for the next HasUUID() broadcast.</p> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=515882017-05-10T19:04:20ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>2017-05-10 sprint</i> to <i>2017-05-24 sprint</i></li></ul> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=516642017-05-10T19:56:40ZTom Cleggtom@curii.com
<ul></ul><p>Peter Amstutz wrote:</p>
<blockquote>
<p>Hhe problem I was trying to solve was that the logs are full of this:</p>
<p>[...]</p>
<p>This is because once the channel is closed, it just keeps hitting scancel() with no delay in between. So I thought it could at least wait for the next HasUUID() broadcast.</p>
</blockquote>
<p>It looks to me like there's a 1-second delay. This delay is inserted by scancel(). The reason it keeps hitting scancel() is that the mocked squeue keeps reporting that the slurm job is still running.</p>
<p>I think the correct way to get rid of those messages would be to fix the squeue mock, so it mocks the behavior we're hoping to test against, instead of a slurm failure mode that makes us (correctly) produce a bunch of debug logs.</p> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=516662017-05-10T20:04:41ZTom Cleggtom@curii.com
<ul></ul><p>oops, just noticed</p>
<p>log.Printf(text)</p>
<p>should be</p>
<p>log.Print(text)</p>
<p>Other than that, LGTM @ <a class="changeset" title="11626: Log sbatch errors where the user can see them. Fix tests so there is no lingering gorouti..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f8675ad473b45387b1286c6b7a41edf36148ebac">f8675ad473b45387b1286c6b7a41edf36148ebac</a></p>
<p>(as tempting as it is to ask for an error check on disp.Arv.Create() so we can log "error logging error: ...")</p> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=516742017-05-10T21:21:44ZTom Cleggtom@curii.com
<ul></ul><p>I missed that the example logs above are from a real site, not the test suite. So the mystery is why scancel wasn't effective.</p>
<p>AFAICT crunch-dispatch-slurm was doing the right thing in the circumstances: keep hitting scancel, with a ≥1 second delay between attempts, until the job goes away.</p>
<p>In any case it seems like a separate topic from "propagate slurm errors to user", because at this point the container is already done and the requesting user probably has little interest in (and ability to fix) slurm problems that happen later. So I think if this scancel behavior needs attention, we should <del>open a new issue.</del> discuss further on <a class="issue tracker-1 status-2 priority-4 priority-default" title="Bug: [Crunch2] crunch-dispatch-slurm hits scancel too much (In Progress)" href="https://dev.arvados.org/issues/11669">#11669</a>.</p> Arvados - Bug #11626: [Crunch2] Still not propagating slurm errors to userhttps://dev.arvados.org/issues/11626?journal_id=516822017-05-11T14:25:05ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset arvados|commit:a309f8c5b4842075d6c83f99a8f2a1e1016976f5.</p>