Project

General

Profile

Actions

Bug #17775

open

[a-d-c] the user should be able to see when preemptible nodes get shut down and the running container requeued

Added by Ward Vandewege over 3 years ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Story points:
-
Release:
Release relationship:
Auto

Description

When a preemptible node is shut down by the cloud provider while running a workflow, the container is automatically requeued. From the user's perspective, it looks like a container is runnning and then suddenly disappears and gets replaced with a new one that in the 'queued' state.

The user should be able to see what happened ("container was stopped and requeued because the cloud node failed/preemptible instance was stopped"). More generally, it would be nice if for a given CR, the user could see all the containers that were created/started/failed (and the relevant timestamps) for that CR in wb/wb2. Right now, when a container fails and automatically requeued, that is quite hard to see.

The cloud providers send a signal when a preemptible node is going to be shut down -- for example, on EC2, crunch-run can poll instance metadata -- ideally we'd catch that and log it, and bubble that up to the user.

Actions #1

Updated by Ward Vandewege over 3 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz over 3 years ago

  • Target version deleted (To Be Groomed)
Actions #3

Updated by Tom Clegg about 3 years ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz almost 2 years ago

  • Release set to 60
Actions #5

Updated by Peter Amstutz 10 months ago

  • Target version set to Future
Actions

Also available in: Atom PDF