Bug #20511
Updated by Peter Amstutz over 1 year ago
I don't know how to interpret this but with arvados-dispatch-cloud running a large job (MaxInstances=400) I am seeing a trend of roughly two "aborted" instances for every "successful" instance (arvados_dispatchcloud_boot_outcomes metric) -- in other words the "aborted" line is growing twice as fast as the "successful" line.
I'm wondering if this is related to #20457 and some kind of churn at the top of the queue.
edit: I'm looking at the code and it looks like "aborted" might just mean the node was shut down intentionally, is that right? (this feels like a bad choice of terminology since "aborted" is usually used to mean terminating from an error condition).
I'm trying to understand why the numbers are out of balance, shouldn't there be 1 shutdown for every 1 successful startup?