Bug #22454
openSpot instance termination not being detected?
Description
We have code in crunch-run that is supposed to detect if a spot instance is going to be reclaimed.
I don't think I have ever seen a spot instance termination notice actually appear on a user cluster, which makes me suspicious that our current implementation doesn't actually work.
We're also seeing this being occasionally reported on failing container runs:
Error checking spot interruptions: Get "http://169.254.169.254/latest/meta-data/spot/instance-action": context deadline exceeded
I suggest that we test this ourselves:
Create a container on a spot instance (cheapest possible) that just sleeps forever. Eventually it will be reclaimed (may take up to 24 hours, though!)
For reference, here's the AWS page I found about it:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
Related, my suspicion is that reclaimed instances show up as cancelled containers that only log up to "Creating Docker container" -- this is the point that saveLogCollection() gets called. It would be helpful if crunch-run called saveLogCollection() again right before Wait(), so that we know that the container actually started.
Updated by Peter Amstutz 6 days ago
- Position changed from -933956 to -933944
- Status changed from New to In Progress
Updated by Tom Clegg 6 days ago
- Related to Bug #22434: "Error checking spot interruptions:" is actually a warning, text should reflect that added
Updated by Peter Amstutz about 22 hours ago
I intentionally submitted a long-running process to see if it gets terminated:
https://workbench.tordo.arvadosapi.com/processes/tordo-xvhdp-mg3xmorv582uyrf