Project

General

Profile

Bug #17775

Updated by Tom Clegg over 2 years ago

When a preemptible node is shut down by the cloud provider while running a workflow, the container is automatically requeued. From the user's perspective, it looks like a container is runnning and then suddenly disappears and gets replaced with a new one that in the 'queued' state. 

 The user should be able to see what happened ("container was stopped and requeued because the cloud node failed/preemptible instance was stopped"). More generally, it would be nice if for a given CR, the user could see *all* the containers that were created/started/failed (and the relevant timestamps) for that CR in wb/wb2. Right now, when a container fails and automatically requeued, that is quite hard to see. 

 The cloud providers send a signal when a preemptible node is going to be shut down -- for example, on EC2, crunch-run can "poll instance metadata":https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/ -- down; ideally we'd catch that and log it, and bubble that up to the user.

Back