Project

General

Profile

Actions

Idea #21542

open

Improved visibility on cloud instance (and maybe other resources?) quotas

Added by Lucas Di Pentima 9 months ago. Updated 9 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Crunch
Target version:
Start date:
Due date:
Story points:
-

Description

Some ideas:

  1. arvados-server cloudtest could check the amount of vCPUs per instance family it has available and report an error/warning when the current quota means that some of its configured instance types won't be able to be requested or just a handful below some threshold.
  2. arvados-dispatch-cloud monitors and exposes via prometheus current quota limits that can be included in a Grafana dashboard, useful for quick debugging scheduling issues, e.g:
    1. Storage for General Purpose SSD: This is important for the autoscaling EBS feature to work, containers could suddenly fail because of storage space starvation if the limit is too small for the workload.
    2. Running On-Demand Standard Instances: I believe this is measured in vCPUs so sometimes requesting just a handful of big nodes could make requesting smaller nodes from the same family impossible.

The rationale behind this is that Cloud Teams sometimes are different from the groups that use and manage Arvados, and so, quotas might change over time without being communicated, with the potential issues this could cause on production clusters that rely on throughput from past runs.

Documentation that might be useful: https://docs.aws.amazon.com/servicequotas/


Related issues

Related to Arvados - Feature #21123: Add API that returns current dispatch/scheduling status for a specified containerResolvedTom CleggActions
Actions

Also available in: Atom PDF