Project

General

Profile

Actions

Bug #20511

closed

High number of "aborted" boot outcomes

Added by Peter Amstutz 12 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Crunch
Target version:
Story points:
-
Release relationship:
Auto

Description

I don't know how to interpret this but with arvados-dispatch-cloud running a large job (MaxInstances=400) I am seeing a trend of roughly two "aborted" instances for every "successful" instance (arvados_dispatchcloud_boot_outcomes metric) -- in other words the "aborted" line is growing twice as fast as the "successful" line.

I'm wondering if this is related to #20457 and some kind of churn at the top of the queue.

edit: I'm looking at the code and it looks like "aborted" might just mean the node was shut down intentionally, is that right? (this feels like a bad choice of terminology since "aborted" is usually used to mean terminating from an error condition).

I'm trying to understand why the numbers are out of balance, shouldn't there be 1 shutdown for every 1 successful startup?


Subtasks 1 (0 open1 closed)

Task #20549: Review 20511-aborted-bootResolvedTom Clegg05/25/2023Actions
Actions

Also available in: Atom PDF