Actions
Feature #20426
openInstaller sets up health check aggregator and monitoring/alerting based on health checks
Story points:
3.0
Description
The idea is:
- have monitoring running on a node that is unlikely to be affected by Arvados issues
- run the health check aggregator
- have Prometheus check the health checks periodically
- configure alertmanager to send out an email to if the health check fails
- add nginx prometheus exporter (https://github.com/nginxinc/nginx-prometheus-exporter) to monitor nginx health
- add grafana graphs to monitor controller's error status codes.
- set up preemptive alerts based on:
- % of status 2xx vs 5xx
- % of concurrent requests vs max concurrent requests (controller)
Updated by Peter Amstutz over 1 year ago
- Category set to Deployment
- Tracker changed from Bug to Feature
Actions