Feature #15340

[arvados-dispatch-cloud] Error-counting metrics

Added by Tom Clegg 18 days ago. Updated 8 days ago.

Assigned To:
Target version:
Start date:
Due date:
% Done:


Estimated time:
Story points:


Add to prometheus metrics:

counter vector arvados_dispatchcloud_driver_operations
  • number of cloud operations, split by operation type (op=Create/Destroy/List/SetTags) and result (error=0/1)
  • can be implemented as a driver proxy similar to rateLimitedInstanceSet in source:lib/dispatchcloud/driver.go
  • most likely usage in graphs/alerts is arvados_dispatchcloud_driver_operations{error=1}
counter vector arvados_dispatchcloud_instances_disappeared
  • number of times an instance disappeared in cloud (see sync() in source:lib/dispatchcloud/worker/pool.go), split by state
  • most likely usage in graphs/alerts is arvados_dispatchcloud_instances_disappeared{state!="shutdown"}

Related issues

Blocks Arvados - Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatchingNew

Associated revisions

Revision 8b9fef1b
Added by Tom Clegg 8 days ago

Merge branch '15340-error-counters'

closes #15340

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>

Revision 2f4429e6
Added by Tom Clegg 4 days ago

Merge branch '15340-error-counters'

refs #15340

Arvados-DCO-1.1-Signed-off-by: Tom Clegg <>


#1 Updated by Tom Clegg 18 days ago

  • Blocks Story #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added

#2 Updated by Tom Morris 13 days ago

  • Story points set to 1.0
  • Target version set to Arvados Future Sprints

#3 Updated by Tom Clegg 11 days ago

  • Description updated (diff)

#6 Updated by Tom Clegg 8 days ago

  • Assigned To set to Tom Clegg
  • Status changed from New to In Progress

#7 Updated by Tom Clegg 8 days ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

#8 Updated by Ward Vandewege 8 days ago

  • Target version changed from Arvados Future Sprints to 2019-06-19 Sprint

Also available in: Atom PDF