Bug #22454
openSpot instance termination not being detected?
Description
We have code in crunch-run that is supposed to detect if a spot instance is going to be reclaimed.
I don't think I have ever seen a spot instance termination notice actually appear on a user cluster, which makes me suspicious that our current implementation doesn't actually work.
We're also seeing this being occasionally reported on failing container runs:
Error checking spot interruptions: Get "http://169.254.169.254/latest/meta-data/spot/instance-action": context deadline exceeded
I suggest that we test this ourselves:
Create a container on a spot instance (cheapest possible) that just sleeps forever. Eventually it will be reclaimed (may take up to 24 hours, though!)
For reference, here's the AWS page I found about it:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
Related, my suspicion is that reclaimed instances show up as cancelled containers that only log up to "Creating Docker container" -- this is the point that saveLogCollection() gets called. It would be helpful if crunch-run called saveLogCollection() again right before Wait(), so that we know that the container actually started.
Updated by Peter Amstutz about 1 month ago
- Position changed from -933956 to -933944
- Status changed from New to In Progress
Updated by Peter Amstutz about 1 month ago
- Status changed from In Progress to New
Updated by Tom Clegg 30 days ago
- Related to Bug #22434: "Error checking spot interruptions:" is actually a warning, text should reflect that added
Updated by Peter Amstutz 25 days ago
I intentionally submitted a long-running process to see if it gets terminated:
https://workbench.tordo.arvadosapi.com/processes/tordo-xvhdp-mg3xmorv582uyrf
Updated by Peter Amstutz 12 days ago
- Target version changed from Development 2025-01-29 to Development 2025-02-12
Updated by Peter Amstutz 11 days ago
- Target version changed from Development 2025-02-12 to Development 2025-02-26
Updated by Peter Amstutz 10 days ago
- Target version changed from Development 2025-02-26 to Development 2025-02-12
- Assigned To changed from Peter Amstutz to Tom Clegg