Story #11349

[Node Manager] Add status URL for node manager

Added by Tom Morris 3 months ago. Updated 2 months ago.

Status:ResolvedStart date:04/10/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:-
Target version:2017-04-12 sprint
Story points2.0Remaining (hours)0.00 hour
Velocity based estimate-

Description

Implemented HTTP server which serves status URL with JSON format output
Configurable port #

Start with the data which is currently being logged:
  • List of nodes sizes
  • Number of nodes in each state
  • State of each node

Subtasks

Task #11375: Review 11349-nodemanager-status-apiResolvedTom Clegg

Task #11447: update wikiResolvedTom Clegg


Related issues

Related to Arvados - Feature #11799: [Node manager] Publish status.json Duplicate
Related to Arvados - Story #11836: [Nodemanager] Improve status.json for monitoring In Progress

Associated revisions

Revision 2c094e28
Added by Tom Clegg 2 months ago

Merge branch '11349-nodemanager-status-api'

refs #11349

Revision a6be53f6
Added by Tom Clegg 2 months ago

Build packages for python "future" module.

refs #11349
refs #11308

Revision 1b290e51
Added by Tom Clegg 2 months ago

11349: Fix section name in example configs.

refs #11349

History

#1 Updated by Tom Morris 3 months ago

  • Description updated (diff)
  • Story points set to 2.0

#2 Updated by Tom Clegg 3 months ago

See source:sdk/python/tests/keepstub.py and source:sdk/python/tests/test_keep_client.py for example of starting up a multithreaded http server.

Suggest maintaining a global status variable, protected by a mutex, and just dumping its content in the status.json handler.

#3 Updated by Tom Clegg 3 months ago

  • Assignee set to Tom Clegg
  • Target version changed from Arvados Future Sprints to 2017-04-12 sprint

#4 Updated by Tom Clegg 3 months ago

  • Status changed from New to In Progress

#5 Updated by Tom Clegg 3 months ago

details (proposed):

New config section "[Manage]" with "port" (127.0.0.1) and "address" (default -1, which disables management server)

status.json response

{
  "nodes_up": 3,
  "nodes_shutdown": 1,
  "nodes_booting": 2,
  "nodes_wish": 4
}

#6 Updated by Tom Clegg 3 months ago

11349-nodemanager-status-api @ ab9a73d2c0b567d3c05d1d4d8463633a69eafda2

#7 Updated by Nico César 3 months ago

I see 2 clusters of questions I get often, one group is about "orchestration-related" or "pipeline-wide" and the other group of questions is about the resources inside a node when a job is running.

From the first group I usually get question like this (which this should help):
  • "why my job is queued for X hours?" -> having a historical # nodes in wishlist could potentially give a clue.
  • "my pipeline ran for 24 hours, which nodes did it use? " -> having a correlation of node with the pipeline helps.
From the second group:
  • "is my node actually doing something?" -> having a "node38: up" doesn't say much, I think that's a question to answer with logs
  • "how many cores/ram/big should my nodes have/be?" -> this is an analysis with the resources inside the node

so I think we can pull information from node manager to respond to the first group, usually this implies that the "node size" isn't as important as "how long has it been up and in which state" . so uniquely identifying the node than been able to plot that is good. But I have to admit that too much detail could turn this in to an Logstash nightmare-adventure I don't want to go, so some summarized state values as a first step is good.

the proposal is good:

{
  "nodes_up": 3,
  "nodes_shutdown": 1,
  "nodes_booting": 2,
  "nodes_wish": 4
}

later will be good to have unique node names and a way to report them over time (which makes it very difficult when they weren't born yet and in the "nodes_wish" pile)

#8 Updated by Tom Clegg 2 months ago

11349-nodemanager-status-api @ e7876a3ac520b128be7836e30172079ab2af5e45

#9 Updated by Lucas Di Pentima 2 months ago

Local test run was successful

Questions:
  • services/nodemanager/arvnodeman/status.py - Do you think it would be good idea to log messages indicating when no management server is started (and maybe the reason?)
  • services/nodemanager/tests/test_status.py:43 - Is that assertion superfluous given the following one? if it’s to prove that old values remain, can it be checked outside the loop?
  • Is the state of each node going to be included? (asking because it's mentioned on the story description)

#10 Updated by Tom Clegg 2 months ago

Lucas Di Pentima wrote:

Local test run was successful
  • services/nodemanager/arvnodeman/status.py - Do you think it would be good idea to log messages indicating when no management server is started (and maybe the reason?)

Yes, added.

        if not self.enabled:
            _logger.warning("Management server disabled. "+
                            "Use [Manage] config section to enable.")
            return
  • services/nodemanager/tests/test_status.py:43 - Is that assertion superfluous given the following one? if it’s to prove that old values remain, can it be checked outside the loop?

Yes, moved it outside the loop.

  • Is the state of each node going to be included? (asking because it's mentioned on the story description)

Indeed, we seem to have changed our minds about that: for now we just want a summary that we can graph easily.

Suggest adding "/nodes.json" with info about each node. (Not sure if we should keep this issue open for it or make a new one.)

11349-nodemanager-status-api @ a779382603d2da2ec38ceb8a21262cc4f151f077

#11 Updated by Lucas Di Pentima 2 months ago

LGTM, thanks!

#12 Updated by Tom Clegg 2 months ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF