Idea #12085
closedAdd monitoring/alarm for failed/slow job dispatch & excess idle nodes
Added by Tom Morris over 7 years ago. Updated over 6 years ago.
Description
We need some additional monitoring and alarms to catch situations like yesterday's crunch-dispatch.rb file descriptor issue.
Some suggestions for alarm conditions:- more than N (15? 15% of running_nodes?) idle nodes for more than M (10?) minutes
- jobs queued for more than 15 minutes when there is idle capacity in the cluster (running_nodes < 0.95 * max_nodes)
The thresholds, sampling periods, and triggers periods can be adjusted as we gain experience with what's too little or too much. The goal is to ignore brief transients or normal steady state churn, but quickly (< 1 hr) catch abnormal conditions which otherwise take us hours to notice on an ad hoc basis.
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-08-16 sprint to 2017-08-30 Sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-08-30 Sprint to 2017-09-13 Sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-09-13 Sprint to 2017-09-27 Sprint
Updated by Tom Morris over 7 years ago
- Target version changed from 2017-09-27 Sprint to 2017-10-11 Sprint
Updated by Tom Clegg over 7 years ago
- max queue time of any queued job/container (this should be implemented in apiserver, not nodemanager)
- number of alive compute nodes
- number of allocated compute nodes
- configured max compute nodes
- total hourly cost
- configured max hourly cost
- are we currently waiting for a node to turn off before trying again because we hit a quota?
- max idle time of any compute node
- max uptime of any compute node
- number of errors received from cloud provider by this process
- this process uptime
- number of occurrences of unpaired→shutdown transition (node was created but never pinged within configured boot-wait)
Updated by Tom Morris over 7 years ago
- Project changed from 40 to Arvados
- Target version changed from 2017-10-11 Sprint to Arvados Future Sprints
Updated by Tom Morris almost 7 years ago
- Target version changed from Arvados Future Sprints to 2018-03-28 Sprint
Updated by Lucas Di Pentima almost 7 years ago
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima almost 7 years ago
- Status changed from New to In Progress
Updated by Nico César almost 7 years ago
from note-5 I'll be ordering by priority for Ops Needs and the reason in parenthesis:
- configured max compute nodes (this will help on all graphs, basically exposing the configuration )
- idle time of all compute node (we can get the max of this, but also we need to know which idle nodes are misbehaving)
- number of alive compute nodes and number of allocated compute nodes (this 2 metrics will give us an idea current state of costs versus actual use)
- number of occurrences of unpaired→shutdown transition (node was created but never pinged within configured boot-wait)
- number of errors received from cloud provider by this process
- number of exceptions generated by all actors in this process (and successfully catched)
- this process uptime (this is kind of irrelevant since the process will suicide if a major problem happens but also in a normal deploy. If it's easy to do, then is better to have it, otherwise just ignore)
- total hourly cost
- configured max hourly cost
- max uptime of any compute node
- are we currently waiting for a node to turn off before trying again because we hit a quota?
- max queue time of any queued job/container (this should be implemented in apiserver, not nodemanager)
Updated by Lucas Di Pentima almost 7 years ago
- Target version changed from 2018-03-28 Sprint to 2018-04-11 Sprint
Updated by Lucas Di Pentima almost 7 years ago
Updates at a02012dd9 - branch 12085-anm-metrics
Test run: https://ci.curoverse.com/job/developer-run-tests/671/
max_nodes
: Expose nodemanager's configurationactor_exceptions
: Actor non fatal error countercloud_errors
: CLOUD_ERRORS exception counterboot_failures
: Number of times any node goes from unpaired to shutdownidle_nodes
: Hash with counters for every node that is on idle state, stating how many seconds is in that state. When a node leaves the idle state, it's removed from this hash (asked Nico about this behavior).
Also added tests for all the new stats.
Regarding the number of alive versus allocated nodes, the status tracker already show how many nodes are on every state, so I think that's enough.
Updated by Peter Amstutz almost 7 years ago
In addition to max_nodes
it should also expose the current value of node_quota
.
Could we make cloud_errors more specific? Like "create_node_errors", "destroy_node_errors", "list_node_errors" ?
It looks like it is missing a call to idle_out() when a node disappears from the cloud node list?
Updated by Peter Amstutz almost 7 years ago
Would it make sense to add a counter like "time_spent_idle" which is the sum of node idle times since node manager started? It might be useful to get a sense of how much time is actually wasted.
Updated by Nico César almost 7 years ago
Peter Amstutz wrote:
Would it make sense to add a counter like "time_spent_idle" which is the sum of node idle times since node manager started? It might be useful to get a sense of how much time is actually wasted.
I think all agregations can be done in the tool that uses this data. Even if is a trivial thing in the code, I feel there will be turtles down the road when node manager restarts (which suicide is the mother of all fallbacks in a-n-m)
Updated by Lucas Di Pentima almost 7 years ago
Updates at c18fb83a3
Test run: https://ci.curoverse.com/job/developer-run-tests/675/
- Added
node_quota
metrics. - Splitted
cloud_errors
intolist_nodes_errors
,create_node_errors
anddestroy_node_errors
. - Added missing
status.tracker.idle_out()
call when a idle node is detected to be missing from the cloud node list. - Added/updated related tests.
Updated by Lucas Di Pentima almost 7 years ago
Rebased against latest master at 5fd2ed9e93670007226a1772040a966fb9dd4d22
Test run: https://ci.curoverse.com/job/developer-run-tests/676/
Updated by Peter Amstutz almost 7 years ago
if record.actor: try: # If it's paired and idle, stop its idle time counter # before removing the monitor actor. if record.actor.get_state().get() == 'idle': status.tracker.idle_out( record.actor.arvados_node.get()['hostname']) record.actor.stop() except pykka.ActorDeadError: pass
Take out if record.actor.get_state().get() == 'idle':
and call it unconditionally.
I believe you can use record.arvados_node["hostname"]
directly intead of calling the actor.
Actually I think the whole block should go outside of "if record.actor"
if record.arvados_node: status.tracker.idle_out(record.arvados_node.get('hostname')) if record.actor: ...
Updated by Lucas Di Pentima almost 7 years ago
Suggestion addressed at 842c85cde
Test run: https://ci.curoverse.com/job/developer-run-tests/677/
Updated by Peter Amstutz almost 7 years ago
Lucas Di Pentima wrote:
Suggestion addressed at 842c85cde
Test run: https://ci.curoverse.com/job/developer-run-tests/677/
LGTM
Updated by Lucas Di Pentima almost 7 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset arvados|2fbdfebf757e5a9b53cf0a21facdf2bd3ea6c757.