Actions
Feature #15340
closed[arvados-dispatch-cloud] Error-counting metrics
Story points:
1.0
Release:
Release relationship:
Auto
Description
Add to prometheus metrics:
counter vectorarvados_dispatchcloud_driver_operations
- number of cloud operations, split by operation type (op=Create/Destroy/List/SetTags) and result (error=0/1)
- can be implemented as a driver proxy similar to rateLimitedInstanceSet in source:lib/dispatchcloud/driver.go
- most likely usage in graphs/alerts is
arvados_dispatchcloud_driver_operations{error=1}
arvados_dispatchcloud_instances_disappeared
- number of times an instance disappeared in cloud (see sync() in source:lib/dispatchcloud/worker/pool.go), split by state
- most likely usage in graphs/alerts is
arvados_dispatchcloud_instances_disappeared{state!="shutdown"}
Related issues
Updated by Tom Clegg over 5 years ago
- Blocks Idea #13908: [Epic] Replace SLURM for cloud job scheduling/dispatching added
Updated by Tom Morris over 5 years ago
- Target version set to Arvados Future Sprints
- Story points set to 1.0
Updated by Tom Clegg over 5 years ago
Updated by Ward Vandewege over 5 years ago
Tom Clegg wrote:
15340-error-counters @ 42966c194493f8e42e26e3d64880e5c93a9c3251 -- https://ci.curoverse.com/view/Developer/job/developer-run-tests/1312/
Assuming there are no further test failures in https://ci.curoverse.com/view/Developer/job/developer-run-tests/1314/, 15340-error-counters @ 92d72b35447cf7210728725b217812492b3855e0 LGTM
Updated by Tom Clegg over 5 years ago
- Status changed from New to In Progress
- Assigned To set to Tom Clegg
Updated by Tom Clegg over 5 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|8b9fef1bf288427d6581a229c2663a96915501b2.
Updated by Ward Vandewege over 5 years ago
- Target version changed from Arvados Future Sprints to 2019-06-19 Sprint
Actions