Project

General

Profile

Actions

Idea #12085

closed

Add monitoring/alarm for failed/slow job dispatch & excess idle nodes

Added by Tom Morris over 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
08/08/2017
Due date:
Story points:
1.0
Release:
Release relationship:
Auto

Description

We need some additional monitoring and alarms to catch situations like yesterday's crunch-dispatch.rb file descriptor issue.

Some suggestions for alarm conditions:
  • more than N (15? 15% of running_nodes?) idle nodes for more than M (10?) minutes
  • jobs queued for more than 15 minutes when there is idle capacity in the cluster (running_nodes < 0.95 * max_nodes)

The thresholds, sampling periods, and triggers periods can be adjusted as we gain experience with what's too little or too much. The goal is to ignore brief transients or normal steady state churn, but quickly (< 1 hr) catch abnormal conditions which otherwise take us hours to notice on an ad hoc basis.


Subtasks 1 (0 open1 closed)

Task #13224: Review 12085-anm-metricsResolvedPeter Amstutz08/08/2017Actions

Related issues

Related to Arvados - Idea #11836: [Nodemanager] Improve status.json for monitoringRejectedPeter Amstutz05/23/2018Actions
Actions

Also available in: Atom PDF