Feature #20426
Updated by Lucas Di Pentima over 1 year ago
The idea is: * have monitoring running on a node that is unlikely to be affected by Arvados issues * run the health check aggregator * have Prometheus check the health checks periodically * configure alertmanager to send out an email to if the health check fails Extra points: * add nginx prometheus exporter (https://github.com/nginxinc/nginx-prometheus-exporter) to monitor nginx health * add grafana graphs to monitor controller's error status codes. * set up preemptive alerts based on: ** % of status 2xx vs 5xx ** % of concurrent requests vs max concurrent requests (controller)