Project

General

Profile

Feature #20426

Updated by Lucas Di Pentima 11 months ago

The idea is: 

 * have monitoring running on a node that is unlikely to be affected by Arvados issues 
 * run the health check aggregator 
 * have Prometheus check the health checks periodically 
 * configure alertmanager to send out an email to if the health check fails 

 Extra points: 
 * add nginx prometheus exporter (https://github.com/nginxinc/nginx-prometheus-exporter) to monitor nginx health 
 * add grafana graphs to monitor controller's error status codes. 
 * set up preemptive alerts based on: 
 ** % of status 2xx vs 5xx 
 ** % of concurrent requests vs max concurrent requests (controller)

Back