Project

General

Profile

Actions

Bug #22454

open

Spot instance termination not being detected?

Added by Peter Amstutz 6 days ago. Updated about 18 hours ago.

Status:
New
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-

Description

We have code in crunch-run that is supposed to detect if a spot instance is going to be reclaimed.

I don't think I have ever seen a spot instance termination notice actually appear on a user cluster, which makes me suspicious that our current implementation doesn't actually work.

We're also seeing this being occasionally reported on failing container runs:

Error checking spot interruptions: Get "http://169.254.169.254/latest/meta-data/spot/instance-action": context deadline exceeded

I suggest that we test this ourselves:

Create a container on a spot instance (cheapest possible) that just sleeps forever. Eventually it will be reclaimed (may take up to 24 hours, though!)

For reference, here's the AWS page I found about it:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html

Related, my suspicion is that reclaimed instances show up as cancelled containers that only log up to "Creating Docker container" -- this is the point that saveLogCollection() gets called. It would be helpful if crunch-run called saveLogCollection() again right before Wait(), so that we know that the container actually started.


Related issues 1 (0 open1 closed)

Related to Arvados - Bug #22434: "Error checking spot interruptions:" is actually a warning, text should reflect thatResolvedTom CleggActions
Actions #1

Updated by Peter Amstutz 6 days ago

  • Position changed from -933956 to -933944
  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz 6 days ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 6 days ago

  • Status changed from In Progress to New
Actions #4

Updated by Peter Amstutz 6 days ago

  • Description updated (diff)
Actions #5

Updated by Tom Clegg 6 days ago

  • Related to Bug #22434: "Error checking spot interruptions:" is actually a warning, text should reflect that added
Actions #6

Updated by Peter Amstutz about 22 hours ago

I intentionally submitted a long-running process to see if it gets terminated:

https://workbench.tordo.arvadosapi.com/processes/tordo-xvhdp-mg3xmorv582uyrf

Actions #7

Updated by Peter Amstutz about 22 hours ago

  • Assigned To set to Peter Amstutz
Actions

Also available in: Atom PDF