https://dev.arvados.org/https://dev.arvados.org/favicon.ico?15576888422023-01-19T20:01:24ZArvadosArvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1101592023-01-19T20:01:24ZTom Cleggtom@curii.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-6 status-2 priority-4 priority-default behind-schedule" href="/issues/18179">Idea #18179</a>: Better spot instance support</i> added</li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1101832023-01-20T15:20:16ZTom Cleggtom@curii.com
<ul><li><strong>Has duplicate</strong> <i><a class="issue tracker-2 status-7 priority-4 priority-default closed" href="/issues/19964">Feature #19964</a>: Check for spot instance interruption notices</i> added</li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1102592023-01-25T17:04:49ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Blocks</strong> <i><a class="issue tracker-2 status-2 priority-4 priority-default parent" href="/issues/19982">Feature #19982</a>: Ability to know when a container died because of spot instance reclamation and option to resubmit</i> added</li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1102682023-01-25T17:10:18ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Description</strong> updated (<a title="View differences" href="/journals/110268/diff?detail_id=107030">diff</a>)</li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1102692023-01-25T18:11:17ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>To be scheduled</i> to <i>2023-02-15 sprint</i></li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1105942023-02-01T16:49:15ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>2023-02-15 sprint</i> to <i>2023-03-01 sprint</i></li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1105952023-02-01T16:52:13ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Target version</strong> changed from <i>2023-03-01 sprint</i> to <i>2023-02-15 sprint</i></li><li><strong>Assigned To</strong> set to <i>Tom Clegg</i></li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1109922023-02-13T21:21:05ZTom Cleggtom@curii.com
<ul></ul><p>Having crunch-run send SIGUSR1 to itself seems like an awkward way to trigger updateLogs, but I was hoping to avoid conflict with <a class="issue tracker-2 status-3 priority-4 priority-default closed parent" title="Feature: crunch-run tracks maximum usage of each crunchstat metric (Resolved)" href="https://dev.arvados.org/issues/19986">#19986</a>. Maybe revisit this part after <a class="issue tracker-2 status-3 priority-4 priority-default closed parent" title="Feature: crunch-run tracks maximum usage of each crunchstat metric (Resolved)" href="https://dev.arvados.org/issues/19986">#19986</a> merges.</p>
<p>Gating this on Driver=="ec2" also seems a little inelegant. Not sure whether it deserves better.</p>
<p>19961-spot-interruption @ <a class="changeset" title="19961: Detect and log EC2 spot interruption notices. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <t..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/7a0626a6ca34771ffd45d2b669a39690acf9a8b0">7a0626a6ca34771ffd45d2b669a39690acf9a8b0</a> -- <a class="external" href="https://ci.arvados.org/job/developer-run-tests/3490/"<a href="https://ci.arvados.org/job/developer-run-tests/3490/">developer-run-tests: #3490 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3490" alt="" /></a></a></p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1109932023-02-13T21:21:12ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>In Progress</i></li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1124322023-02-15T16:33:49ZTom Cleggtom@curii.com
<ul><li><strong>Target version</strong> changed from <i>2023-02-15 sprint</i> to <i>2023-03-01 sprint</i></li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1125532023-02-16T22:00:50ZTom Cleggtom@curii.com
<ul></ul><p>19961-spot-interruption @ <a class="changeset" title="19961: Fix races in tests. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/9fa741f376df64f11b3af4e3fb7e4d30bcba795f">9fa741f376df64f11b3af4e3fb7e4d30bcba795f</a> -- <a class="external" href="https://ci.arvados.org/job/developer-run-tests/3496/"<a href="https://ci.arvados.org/job/developer-run-tests/3496/">developer-run-tests: #3496 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3496" alt="" /></a></a></p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1125742023-02-18T23:33:40ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Story says "should have an easy, documented way determining programmatically that this happened" but there's no documentation in the branch. I think it would suffice to note how we update <code>runtime_status</code> when we detect spot instance termination. I admit I'm ambivalent about where. That section of the containers API reference? A quick note about it in our "Using preemptible instances" admin guide would be a nice affordance too.</p>
<p>I feel uncomfortable with the "give up after 3 failures" retry logic. I can imagine a scenario where a long-running spot instance just happens to have three failures over time from network hiccups or other small gremlins like that and then stops monitoring even with plenty of runtime left. How do you feel about making it "give up after 5 consecutive failures" or similar? I feel like that would still cover the kinds of problems we're actually worried about. I'm open to other approaches too but that's easy to implement, just <code>failures = 0</code> in the success branch.</p>
<p>It would be nice to see a test for a case where we give up completely.</p>
<p>I'm not wild about the way we get a fresh API token for every check. My main concern is security: creating all these relatively long-lived tokens opens more avenues for token theft. My secondary concern is making two HTTP requests for every check doubles our chances of transient failure.</p>
<p>If you agree with some adjustment to the failure count logic, an easy implementation that wouldn't require any locking or anything is: <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html" class="external">the API returns a 401 if your token is invalid for any reason</a>. <code>check</code> could first access the endpoint, and in case of 401, issue a request for a new token, and if that succeeds, <em>still just return the 401 error</em>. In normal operation, it should be transient, and it won't affect the outer loop for long.</p>
<p>If you don't agree this needs addressing, I would accept a version of the current code that cuts the token TTL way down. I don't see any reason it would help for the TTL to be much longer than the deadline we set on the context, and this would at least address the security concern.</p>
<p>Thanks.</p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1126072023-02-22T14:27:08ZTom Cleggtom@curii.com
<ul></ul><p>Docs: I added a bit in "using preemptible instances" with an example log entry. The containers API page doesn't (yet?) have a list of specific values that can appear in <code>runtime_status</code>. The log entry is probably the one cwl-runner should look for, given that the <code>runtime_status</code> warning can potentially get overwritten by other warnings.</p>
<p>Giving up: Now we give up after 5 consecutive failures, I agree that makes much more sense. There's a test for the "give up" case, and occasional failures in the success test case.</p>
API token: Now we ask for a new token only
<ul>
<li>on the first check,</li>
<li>when the current token is about to expire (assuming the TTL we requested is the real TTL), and</li>
<li>when the previous request returned 401</li>
</ul>
<p>This should give us some logging evidence (but still function) if TTL doesn't work the way we expect.</p>
<p>19961-spot-interruption @ <a class="changeset" title="19961: Mention interrupt handling on admin doc page. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <t..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/475dc10274ca275966aa6eefc25b8932cc4f3957">475dc10274ca275966aa6eefc25b8932cc4f3957</a> -- <a class="external" href="https://ci.arvados.org/view/Developer/job/developer-run-tests/3507/"<a href="https://ci.arvados.org/view/Developer/job/developer-run-tests/3507/">developer-run-tests: #3507 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3507" alt="" /></a></a></p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1126102023-02-22T17:07:05ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Tom Clegg wrote in <a href="#note-13">#note-13</a>:</p>
<blockquote>
<p>Docs: I added a bit in "using preemptible instances" with an example log entry. The containers API page doesn't (yet?) have a list of specific values that can appear in <code>runtime_status</code>. The log entry is probably the one cwl-runner should look for, given that the <code>runtime_status</code> warning can potentially get overwritten by other warnings.</p>
</blockquote>
<p>Well, okay, what I'm about to talk about is clearly an "insufficiently groomed story" problem more than a branch problem. Bearing that in mind: the story says we should provide an "easy, documented way determining programmatically that this happened." AWS gives us a documented JSON endpoint with two fields that announces their intention to take action N at time T. I think that's the level of "easy" we should strive for. In the context of Arvados, I think that means, container records have a field that briefly, predictably documents instance interference from the cloud provider, if Crunch noticed any. The reason we want that is so users can make informed decisions about whether they would like to programmatically retry a container, and how.</p>
<p>A dedicated JSON field is easy because you can check it with curl and jq, no custom tooling needed. By contrast, searching for a line in the container log means you need to also fetch the collection record, parse the manifest, download one or more data blocks from Keep, and grab potentially tens of megabytes of data to search for a single line. Realistically this will probably require its own little script or function, supported by either the Arvados SDK, CLI tools, or maybe arv-mount (and all the support <em>that</em> entails, like a user with FUSE permission).</p>
<p>Is there some way we can make this information easier to query without major ticket scope creep? I realize this touches on general API considerations that we should've addressed first. But if it's doable to just carve out a new runtime status field that's dedicated to this, that would be great. Or something similar to that. But if it's too much too late, I understand.</p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1126462023-02-23T18:34:11ZPeter Amstutzpeter.amstutz@curii.com
<ul></ul><p>From discussion: add a new key to <code>runtime_status</code> called something like <code>preemption_notice</code> and document it on the <code>containers</code> API page. If this is non-empty, it means the instance got a preemption notice. The exact format of the value does not need to be defined. A notice should also be added to <code>warnings</code> so that it is visible on Workbench.</p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1126472023-02-23T18:53:41ZPeter Amstutzpeter.amstutz@curii.com
<ul><li><strong>Release</strong> set to <i>57</i></li></ul> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1126562023-02-23T22:02:35ZTom Cleggtom@curii.com
<ul></ul><p>19961-spot-interruption @ <a class="changeset" title="19961: Save separate preemptionNotice key in runtime_status. Arvados-DCO-1.1-Signed-off-by: Tom ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/ae6fe3864ca6b254dfa3345985568c1cc94358fe">ae6fe3864ca6b254dfa3345985568c1cc94358fe</a> -- <a class="external" href="https://ci.arvados.org/view/Developer/job/developer-run-tests/3511/"<a href="https://ci.arvados.org/view/Developer/job/developer-run-tests/3511/">developer-run-tests: #3511 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3511" alt="" /></a></a></p>
<p>Adds <code>preemptionNotice</code> key to <code>runtime_status</code>, with the same content as <code>warningDetail</code>. (Is this useful, or merely repetitive?)</p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1126952023-02-27T15:00:01ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Tom Clegg wrote in <a href="#note-17">#note-17</a>:</p>
<blockquote>
<p>19961-spot-interruption @ <a class="changeset" title="19961: Save separate preemptionNotice key in runtime_status. Arvados-DCO-1.1-Signed-off-by: Tom ..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/ae6fe3864ca6b254dfa3345985568c1cc94358fe">ae6fe3864ca6b254dfa3345985568c1cc94358fe</a> -- <a class="external" href="https://ci.arvados.org/view/Developer/job/developer-run-tests/3511/"<a href="https://ci.arvados.org/view/Developer/job/developer-run-tests/3511/">developer-run-tests: #3511 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3511" alt="" /></a></a></p>
</blockquote>
<p>Thank you for sticking through all the late changes. I especially appreciate that it still felt easy to review even with all that.</p>
<p>I think it would be helpful if the documentation was explicit that checking <code>preemptionNotice</code> is the best way to check if your instance was preempted. Maybe add a sentence like this to the end of the first paragraph:</p>
<blockquote>
<p>An API client that wants to detect whether or not a container was preempted should check whether <code>runtime_status</code> has a <code>preemptionNotice</code> set.</p>
</blockquote>
<p>The reference says:</p>
<blockquote>
<p>Indication that the preemptible instance where the container is running will be terminated soon.</p>
</blockquote>
<p>I think this has two minor inaccuracies:</p>
<ul>
<li>Since "hibernate" is a possible action, I think this covers more than just "terminated." </li>
<li>Since people can view the container record after the fact, this might be old information, not just "soon."</li>
</ul>
<p>What about:</p>
<blockquote>
<p>Details about any cloud provider scheduled interruption to the spot instance running this container</p>
</blockquote>
<p>In general, I wonder why our documentation generally talks about "spot instances" while our API talks about "preemptible instances." Do you know, is that just sort of a historical accident, or are we getting at some fine distinction here, or what? (I had the thought we chose "preemptible instances" as a provider-agnostic name, but a quick search suggests all three major providers call them "spot instances.")</p>
<p>The recorded message feels a little verbose. It's fine, I won't hold up a merge on this, but as a suggestion, how would you feel about:</p>
<blockquote>
<p>Cloud provider scheduled instance %s at %s</p>
</blockquote>
<p>filled in with the action and RFC3339 timestamp, as now.</p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1127012023-02-27T19:49:19ZTom Cleggtom@curii.com
<ul></ul><p>Brett Smith wrote in <a href="#note-18">#note-18</a>:</p>
<blockquote><blockquote>
<p>An API client that wants to detect whether or not a container was preempted should check whether <code>runtime_status</code> has a <code>preemptionNotice</code> set.</p>
</blockquote></blockquote>
<p>Good point. How about this? (I added it to the runtime_status section of the containers API page since it seems more like API client advice than install/admin advice.)</p>
<p>"Existence of this key indicates the container likely was (or will soon be) <code>Cancelled</code> due to an instance interruption."</p>
<blockquote><blockquote>
<p>Details about any cloud provider scheduled interruption to the spot instance running this container</p>
</blockquote></blockquote>
<p>Yes. Changed to that.</p>
<blockquote>
<p>In general, I wonder why our documentation generally talks about "spot instances" while our API talks about "preemptible instances." Do you know, is that just sort of a historical accident, or are we getting at some fine distinction here, or what? (I had the thought we chose "preemptible instances" as a provider-agnostic name, but a quick search suggests all three major providers call them "spot instances.")</p>
</blockquote>
<p>When we started this feature, Google had preemptible instances, so it was the more descriptive/generic term. Google has since introduced spot instances which (unlike Google preemptible instances) are allowed to run more than 24h and therefore are probably more suitable for Arvados. So, maybe "preemptible" is a more descriptive/generic term, and maybe it was a mistake to pay any attention to the fluid namescape of Google. We still don't have a GCP driver yet.</p>
<blockquote><blockquote>
<p>Cloud provider scheduled instance %s at %s</p>
</blockquote></blockquote>
<p>Yes. Changed to that.</p>
<p>19961-spot-interruption @ <a class="changeset" title="19961: Update preemptionNotice text. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f99665a0737831ba53b6512fd50f1c25e386a604">f99665a0737831ba53b6512fd50f1c25e386a604</a> -- <a class="external" href="https://ci.arvados.org/job/developer-run-tests/3512/"<a href="https://ci.arvados.org/job/developer-run-tests/3512/">developer-run-tests: #3512 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3512" alt="" /></a></a></p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1127032023-02-27T22:12:27ZBrett Smithbrett.smith@curii.com
<ul></ul><p>Tom Clegg wrote in <a href="#note-19">#note-19</a>:</p>
<blockquote>
<p>Good point. How about this?</p>
</blockquote>
<p>Yeah that makes sense to me.</p>
<blockquote>
<p>19961-spot-interruption @ <a class="changeset" title="19961: Update preemptionNotice text. Arvados-DCO-1.1-Signed-off-by: Tom Clegg <tom@curii.com>" href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/f99665a0737831ba53b6512fd50f1c25e386a604">f99665a0737831ba53b6512fd50f1c25e386a604</a> -- <a class="external" href="https://ci.arvados.org/job/developer-run-tests/3512/"<a href="https://ci.arvados.org/job/developer-run-tests/3512/">developer-run-tests: #3512 <img src="https://ci.arvados.org/buildStatus/icon?job=developer-run-tests&build=3512" alt="" /></a></a></p>
</blockquote>
<p>Looks good to me. Thanks again.</p> Arvados - Feature #19961: Detect and log spot instance interruption noticeshttps://dev.arvados.org/issues/19961?journal_id=1127952023-02-28T21:04:17ZTom Cleggtom@curii.com
<ul><li><strong>Status</strong> changed from <i>In Progress</i> to <i>Resolved</i></li></ul><p>Applied in changeset <a class="changeset" title="Merge branch '19961-spot-interruption' closes #19961 Arvados-DCO-1.1-Signed-off-by: Tom Clegg <..." href="https://dev.arvados.org/projects/arvados/repository/arvados/revisions/1071f4f96fcb2084424c4b29dd5915880c650254">arvados|1071f4f96fcb2084424c4b29dd5915880c650254</a>.</p>