Project

General

Profile

Actions

Feature #15340

closed

[arvados-dispatch-cloud] Error-counting metrics

Added by Tom Clegg almost 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Story points:
1.0
Release relationship:
Auto

Description

Add to prometheus metrics:

counter vector arvados_dispatchcloud_driver_operations
  • number of cloud operations, split by operation type (op=Create/Destroy/List/SetTags) and result (error=0/1)
  • can be implemented as a driver proxy similar to rateLimitedInstanceSet in source:lib/dispatchcloud/driver.go
  • most likely usage in graphs/alerts is arvados_dispatchcloud_driver_operations{error=1}
counter vector arvados_dispatchcloud_instances_disappeared
  • number of times an instance disappeared in cloud (see sync() in source:lib/dispatchcloud/worker/pool.go), split by state
  • most likely usage in graphs/alerts is arvados_dispatchcloud_instances_disappeared{state!="shutdown"}

Related issues

Blocks Arvados - Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingResolvedActions
Actions #1

Updated by Tom Clegg almost 5 years ago

  • Blocks Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Actions #2

Updated by Tom Morris almost 5 years ago

  • Target version set to Arvados Future Sprints
  • Story points set to 1.0
Actions #3

Updated by Tom Clegg almost 5 years ago

  • Description updated (diff)
Actions #6

Updated by Tom Clegg almost 5 years ago

  • Status changed from New to In Progress
  • Assigned To set to Tom Clegg
Actions #7

Updated by Tom Clegg almost 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100
Actions #8

Updated by Ward Vandewege almost 5 years ago

  • Target version changed from Arvados Future Sprints to 2019-06-19 Sprint
Actions #9

Updated by Peter Amstutz about 4 years ago

  • Release set to 22
Actions

Also available in: Atom PDF