Project

General

Profile

Actions

Feature #19961

closed

Detect and log spot instance interruption notices

Added by Tom Clegg about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
1.0
Release relationship:
Auto

Description

When running on a spot instance, crunch-run should check for spot instance interruption notices (AWS recommends checking every 5s) and, when an interruption notice is detected,
  • mention it in the crunch-run.txt log file
  • mention it in the runtime_status container field
  • save the log collection
  • update the container record with the new log pdh
  • continue running (perhaps the container will finish before the instance gets shut down)
  • should have an easy, documented way determining programmatically that this happened

Subtasks 1 (0 open1 closed)

Task #20051: Review 19961-spot-interruptionResolvedTom Clegg02/16/2023Actions

Related issues

Related to Arvados Epics - Idea #18179: Better spot instance supportIn Progress03/01/202203/31/2024Actions
Has duplicate Arvados - Feature #19964: Check for spot instance interruption noticesDuplicateActions
Blocks Arvados - Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmitIn ProgressAlex ColemanActions
Actions #1

Updated by Tom Clegg about 1 year ago

  • Related to Idea #18179: Better spot instance support added
Actions #2

Updated by Tom Clegg about 1 year ago

  • Has duplicate Feature #19964: Check for spot instance interruption notices added
Actions #3

Updated by Peter Amstutz about 1 year ago

  • Blocks Feature #19982: Ability to know when a container died because of spot instance reclamation and option to resubmit added
Actions #4

Updated by Peter Amstutz about 1 year ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz about 1 year ago

  • Target version changed from To be scheduled to 2023-02-15 sprint
Actions #6

Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Actions #7

Updated by Peter Amstutz about 1 year ago

  • Target version changed from 2023-03-01 sprint to 2023-02-15 sprint
  • Assigned To set to Tom Clegg
Actions #8

Updated by Tom Clegg about 1 year ago

Having crunch-run send SIGUSR1 to itself seems like an awkward way to trigger updateLogs, but I was hoping to avoid conflict with #19986. Maybe revisit this part after #19986 merges.

Gating this on Driver=="ec2" also seems a little inelegant. Not sure whether it deserves better.

19961-spot-interruption @ 7a0626a6ca34771ffd45d2b669a39690acf9a8b0 -- developer-run-tests: #3490

Actions #9

Updated by Tom Clegg about 1 year ago

  • Status changed from New to In Progress
Actions #10

Updated by Tom Clegg about 1 year ago

  • Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Actions #12

Updated by Brett Smith about 1 year ago

Story says "should have an easy, documented way determining programmatically that this happened" but there's no documentation in the branch. I think it would suffice to note how we update runtime_status when we detect spot instance termination. I admit I'm ambivalent about where. That section of the containers API reference? A quick note about it in our "Using preemptible instances" admin guide would be a nice affordance too.

I feel uncomfortable with the "give up after 3 failures" retry logic. I can imagine a scenario where a long-running spot instance just happens to have three failures over time from network hiccups or other small gremlins like that and then stops monitoring even with plenty of runtime left. How do you feel about making it "give up after 5 consecutive failures" or similar? I feel like that would still cover the kinds of problems we're actually worried about. I'm open to other approaches too but that's easy to implement, just failures = 0 in the success branch.

It would be nice to see a test for a case where we give up completely.

I'm not wild about the way we get a fresh API token for every check. My main concern is security: creating all these relatively long-lived tokens opens more avenues for token theft. My secondary concern is making two HTTP requests for every check doubles our chances of transient failure.

If you agree with some adjustment to the failure count logic, an easy implementation that wouldn't require any locking or anything is: the API returns a 401 if your token is invalid for any reason. check could first access the endpoint, and in case of 401, issue a request for a new token, and if that succeeds, still just return the 401 error. In normal operation, it should be transient, and it won't affect the outer loop for long.

If you don't agree this needs addressing, I would accept a version of the current code that cuts the token TTL way down. I don't see any reason it would help for the TTL to be much longer than the deadline we set on the context, and this would at least address the security concern.

Thanks.

Actions #13

Updated by Tom Clegg about 1 year ago

Docs: I added a bit in "using preemptible instances" with an example log entry. The containers API page doesn't (yet?) have a list of specific values that can appear in runtime_status. The log entry is probably the one cwl-runner should look for, given that the runtime_status warning can potentially get overwritten by other warnings.

Giving up: Now we give up after 5 consecutive failures, I agree that makes much more sense. There's a test for the "give up" case, and occasional failures in the success test case.

API token: Now we ask for a new token only
  • on the first check,
  • when the current token is about to expire (assuming the TTL we requested is the real TTL), and
  • when the previous request returned 401

This should give us some logging evidence (but still function) if TTL doesn't work the way we expect.

19961-spot-interruption @ 475dc10274ca275966aa6eefc25b8932cc4f3957 -- developer-run-tests: #3507

Actions #14

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-13:

Docs: I added a bit in "using preemptible instances" with an example log entry. The containers API page doesn't (yet?) have a list of specific values that can appear in runtime_status. The log entry is probably the one cwl-runner should look for, given that the runtime_status warning can potentially get overwritten by other warnings.

Well, okay, what I'm about to talk about is clearly an "insufficiently groomed story" problem more than a branch problem. Bearing that in mind: the story says we should provide an "easy, documented way determining programmatically that this happened." AWS gives us a documented JSON endpoint with two fields that announces their intention to take action N at time T. I think that's the level of "easy" we should strive for. In the context of Arvados, I think that means, container records have a field that briefly, predictably documents instance interference from the cloud provider, if Crunch noticed any. The reason we want that is so users can make informed decisions about whether they would like to programmatically retry a container, and how.

A dedicated JSON field is easy because you can check it with curl and jq, no custom tooling needed. By contrast, searching for a line in the container log means you need to also fetch the collection record, parse the manifest, download one or more data blocks from Keep, and grab potentially tens of megabytes of data to search for a single line. Realistically this will probably require its own little script or function, supported by either the Arvados SDK, CLI tools, or maybe arv-mount (and all the support that entails, like a user with FUSE permission).

Is there some way we can make this information easier to query without major ticket scope creep? I realize this touches on general API considerations that we should've addressed first. But if it's doable to just carve out a new runtime status field that's dedicated to this, that would be great. Or something similar to that. But if it's too much too late, I understand.

Actions #15

Updated by Peter Amstutz about 1 year ago

From discussion: add a new key to runtime_status called something like preemption_notice and document it on the containers API page. If this is non-empty, it means the instance got a preemption notice. The exact format of the value does not need to be defined. A notice should also be added to warnings so that it is visible on Workbench.

Actions #16

Updated by Peter Amstutz about 1 year ago

  • Release set to 57
Actions #17

Updated by Tom Clegg about 1 year ago

19961-spot-interruption @ ae6fe3864ca6b254dfa3345985568c1cc94358fe -- developer-run-tests: #3511

Adds preemptionNotice key to runtime_status, with the same content as warningDetail. (Is this useful, or merely repetitive?)

Actions #18

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-17:

19961-spot-interruption @ ae6fe3864ca6b254dfa3345985568c1cc94358fe -- developer-run-tests: #3511

Thank you for sticking through all the late changes. I especially appreciate that it still felt easy to review even with all that.

I think it would be helpful if the documentation was explicit that checking preemptionNotice is the best way to check if your instance was preempted. Maybe add a sentence like this to the end of the first paragraph:

An API client that wants to detect whether or not a container was preempted should check whether runtime_status has a preemptionNotice set.

The reference says:

Indication that the preemptible instance where the container is running will be terminated soon.

I think this has two minor inaccuracies:

  • Since "hibernate" is a possible action, I think this covers more than just "terminated."
  • Since people can view the container record after the fact, this might be old information, not just "soon."

What about:

Details about any cloud provider scheduled interruption to the spot instance running this container

In general, I wonder why our documentation generally talks about "spot instances" while our API talks about "preemptible instances." Do you know, is that just sort of a historical accident, or are we getting at some fine distinction here, or what? (I had the thought we chose "preemptible instances" as a provider-agnostic name, but a quick search suggests all three major providers call them "spot instances.")

The recorded message feels a little verbose. It's fine, I won't hold up a merge on this, but as a suggestion, how would you feel about:

Cloud provider scheduled instance %s at %s

filled in with the action and RFC3339 timestamp, as now.

Actions #19

Updated by Tom Clegg about 1 year ago

Brett Smith wrote in #note-18:

An API client that wants to detect whether or not a container was preempted should check whether runtime_status has a preemptionNotice set.

Good point. How about this? (I added it to the runtime_status section of the containers API page since it seems more like API client advice than install/admin advice.)

"Existence of this key indicates the container likely was (or will soon be) Cancelled due to an instance interruption."

Details about any cloud provider scheduled interruption to the spot instance running this container

Yes. Changed to that.

In general, I wonder why our documentation generally talks about "spot instances" while our API talks about "preemptible instances." Do you know, is that just sort of a historical accident, or are we getting at some fine distinction here, or what? (I had the thought we chose "preemptible instances" as a provider-agnostic name, but a quick search suggests all three major providers call them "spot instances.")

When we started this feature, Google had preemptible instances, so it was the more descriptive/generic term. Google has since introduced spot instances which (unlike Google preemptible instances) are allowed to run more than 24h and therefore are probably more suitable for Arvados. So, maybe "preemptible" is a more descriptive/generic term, and maybe it was a mistake to pay any attention to the fluid namescape of Google. We still don't have a GCP driver yet.

Cloud provider scheduled instance %s at %s

Yes. Changed to that.

19961-spot-interruption @ f99665a0737831ba53b6512fd50f1c25e386a604 -- developer-run-tests: #3512

Actions #20

Updated by Brett Smith about 1 year ago

Tom Clegg wrote in #note-19:

Good point. How about this?

Yeah that makes sense to me.

19961-spot-interruption @ f99665a0737831ba53b6512fd50f1c25e386a604 -- developer-run-tests: #3512

Looks good to me. Thanks again.

Actions #21

Updated by Tom Clegg about 1 year ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF