Bug #18161: [a-d-c] the arvados_dispatchcloud_queue_entries prometheus metric should report actual instance types - Arvados

Actions

Copy link

Bug #18161

open

[a-d-c] the arvados_dispatchcloud_queue_entries prometheus metric should report actual instance types

Added by Ward Vandewege over 3 years ago. Updated about 1 year ago.

Status:

New

Priority:

Normal

Assigned To:

Category:

Target version:

Future

Story points:

Release:

Postponed

Release relationship:

Auto

Description

The arvados_dispatchcloud_queue_entries metric is implemented in the "container queue" module which knows nothing about instances. It reports the best instance type for a set of resource requirements, based on the current configuration file.

This can cause inaccurate metrics when the node definitions in the configuration file are changed (and a-d-c is restarted) while containers are running. Instead of getting actual data you get aspirational data at this point. Today, a job was started that used 48 m5a.xlarge nodes and then ran into cloud capacity problems (spot). I updated the config file to make m5a.xlarge much more expensive, and restarted a-d-c, which promptly started the rest of the pending containers on m5.xlarge nodes. But the metric now reported 96 containers running on m5.xlarge, instead of the reality, which was 48 on m5a.xlarge and 48 on m5.xlarge.

The `arvados_dispatchcloud_instances_total` metrics (aka `node by state`) are correct in this scenario, and do not need fixing.

The `arvados_dispatchcloud_queue_entries` metric should be moved to the scheduler, which knows about queues and workers, and be changed to report actual information.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Bug #18161

[a-d-c] the arvados_dispatchcloud_queue_entries prometheus metric should report actual instance types

Updated by Ward Vandewege over 3 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 1 year ago