Idea #21542
openImproved visibility on cloud instance (and maybe other resources?) quotas
Description
Some ideas:
arvados-server cloudtest
could check the amount of vCPUs per instance family it has available and report an error/warning when the current quota means that some of its configured instance types won't be able to be requested or just a handful below some threshold.arvados-dispatch-cloud
monitors and exposes via prometheus current quota limits that can be included in a Grafana dashboard, useful for quick debugging scheduling issues, e.g:- Storage for General Purpose SSD: This is important for the autoscaling EBS feature to work, containers could suddenly fail because of storage space starvation if the limit is too small for the workload.
- Running On-Demand Standard Instances: I believe this is measured in vCPUs so sometimes requesting just a handful of big nodes could make requesting smaller nodes from the same family impossible.
The rationale behind this is that Cloud Teams sometimes are different from the groups that use and manage Arvados, and so, quotas might change over time without being communicated, with the potential issues this could cause on production clusters that rely on throughput from past runs.
Documentation that might be useful: https://docs.aws.amazon.com/servicequotas/
Related issues
Updated by Peter Amstutz 9 months ago
- Related to Feature #21123: Add API that returns current dispatch/scheduling status for a specified container added
Updated by Peter Amstutz 9 months ago
Related, I think if #21123 was able to surface the errors that are currently in the logs (e.g. hit vCPU quota) that would also go a long way towards at least understanding what's happening.
I think it's a bit hard to check for this proactively without knowing what the workload is actually going to be.
Updated by Brett Smith 9 months ago
Peter Amstutz wrote in #note-2:
I think it's a bit hard to check for this proactively without knowing what the workload is actually going to be.
Had a similar thought. I respect the goal here but reporting "just a handful below some threshold" is pretty vague as requirements go. The real question is "will it be able to spin up the number of instances the user expects," and that could be anything—especially since different individuals might have different expectations.
Updated by Lucas Di Pentima 9 months ago
Brett Smith wrote in #note-3:
Peter Amstutz wrote in #note-2:
I think it's a bit hard to check for this proactively without knowing what the workload is actually going to be.
Had a similar thought. I respect the goal here but reporting "just a handful below some threshold" is pretty vague as requirements go. The real question is "will it be able to spin up the number of instances the user expects," and that could be anything—especially since different individuals might have different expectations.
We could estimate how much the users will want from the MaxInstances
config knob.
I agree that #21123 would solve part of what this story is about (didn't had it in mind when I wrote this), but having a preemptive way of checking too restrictive quotas might be beneficial for cluster set up, and exposing quota limits as metrics would also add to the whole cluster status monitoring experience.