Project

General

Profile

Actions

Feature #15340

closed

[arvados-dispatch-cloud] Error-counting metrics

Added by Tom Clegg almost 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release relationship:
Auto

Description

Add to prometheus metrics:

counter vector arvados_dispatchcloud_driver_operations
  • number of cloud operations, split by operation type (op=Create/Destroy/List/SetTags) and result (error=0/1)
  • can be implemented as a driver proxy similar to rateLimitedInstanceSet in source:lib/dispatchcloud/driver.go
  • most likely usage in graphs/alerts is arvados_dispatchcloud_driver_operations{error=1}
counter vector arvados_dispatchcloud_instances_disappeared
  • number of times an instance disappeared in cloud (see sync() in source:lib/dispatchcloud/worker/pool.go), split by state
  • most likely usage in graphs/alerts is arvados_dispatchcloud_instances_disappeared{state!="shutdown"}

Related issues

Blocks Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Actions

Also available in: Atom PDF