Bug #18161

[a-d-c] the arvados_dispatchcloud_queue_entries prometheus metric should report actual instance types

Added by Ward Vandewege about 1 month ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

The arvados_dispatchcloud_queue_entries metric is implemented in the "container queue" module which knows nothing about instances. It reports the best instance type for a set of resource requirements, based on the current configuration file.

This can cause inaccurate metrics when the node definitions in the configuration file are changed (and a-d-c is restarted) while containers are running. Instead of getting actual data you get aspirational data at this point. Today, a job was started that used 48 m5a.xlarge nodes and then ran into cloud capacity problems (spot). I updated the config file to make m5a.xlarge much more expensive, and restarted a-d-c, which promptly started the rest of the pending containers on m5.xlarge nodes. But the metric now reported 96 containers running on m5.xlarge, instead of the reality, which was 48 on m5a.xlarge and 48 on m5.xlarge.

The `arvados_dispatchcloud_instances_total` metrics (aka `node by state`) are correct in this scenario, and do not need fixing.

The `arvados_dispatchcloud_queue_entries` metric should be moved to the scheduler, which knows about queues and workers, and be changed to report actual information.

History

#1 Updated by Ward Vandewege about 1 month ago

  • Description updated (diff)

Also available in: Atom PDF