Project

General

Profile

Actions

Feature #20426

open

Installer sets up health check aggregator and monitoring/alerting based on health checks

Added by Peter Amstutz 12 months ago. Updated 12 months ago.

Status:
New
Priority:
Normal
Assigned To:
-
Category:
Deployment
Target version:
Story points:
3.0

Description

The idea is:

  • have monitoring running on a node that is unlikely to be affected by Arvados issues
  • run the health check aggregator
  • have Prometheus check the health checks periodically
  • configure alertmanager to send out an email to if the health check fails
Extra points:
  • add nginx prometheus exporter (https://github.com/nginxinc/nginx-prometheus-exporter) to monitor nginx health
  • add grafana graphs to monitor controller's error status codes.
  • set up preemptive alerts based on:
    • % of status 2xx vs 5xx
    • % of concurrent requests vs max concurrent requests (controller)
Actions #1

Updated by Peter Amstutz 12 months ago

  • Category set to Deployment
  • Tracker changed from Bug to Feature
Actions #2

Updated by Peter Amstutz 12 months ago

  • Description updated (diff)
Actions #3

Updated by Peter Amstutz 12 months ago

  • Story points set to 3.0
Actions #4

Updated by Lucas Di Pentima 12 months ago

  • Description updated (diff)
Actions

Also available in: Atom PDF