Story #11349

[Node Manager] Add status URL for node manager

Added by Tom Morris about 1 month ago. Updated 16 days ago.

Status:ResolvedStart date:04/10/2017
Priority:NormalDue date:
Assignee:Tom Clegg% Done:

100%

Category:-
Target version:2017-04-12 sprint
Story points2.0Remaining (hours)0.00 hour
Velocity based estimate-

Description

Implemented HTTP server which serves status URL with JSON format output
Configurable port #

Start with the data which is currently being logged:
  • List of nodes sizes
  • Number of nodes in each state
  • State of each node

Subtasks

Task #11375: Review 11349-nodemanager-status-apiResolvedTom Clegg

Task #11447: update wikiResolvedTom Clegg

Associated revisions

Revision 2c094e28
Added by Tom Clegg 17 days ago

Merge branch '11349-nodemanager-status-api'

refs #11349

Revision a6be53f6
Added by Tom Clegg 17 days ago

Build packages for python "future" module.

refs #11349
refs #11308

Revision 1b290e51
Added by Tom Clegg 16 days ago

11349: Fix section name in example configs.

refs #11349

History

#1 Updated by Tom Morris about 1 month ago

  • Description updated (diff)
  • Story points set to 2.0

#2 Updated by Tom Clegg about 1 month ago

See source:sdk/python/tests/keepstub.py and source:sdk/python/tests/test_keep_client.py for example of starting up a multithreaded http server.

Suggest maintaining a global status variable, protected by a mutex, and just dumping its content in the status.json handler.

#3 Updated by Tom Clegg about 1 month ago

  • Assignee set to Tom Clegg
  • Target version changed from Arvados Future Sprints to 2017-04-12 sprint

#4 Updated by Tom Clegg 21 days ago

  • Status changed from New to In Progress

#5 Updated by Tom Clegg 21 days ago

details (proposed):

New config section "[Manage]" with "port" (127.0.0.1) and "address" (default -1, which disables management server)

status.json response

{
  "nodes_up": 3,
  "nodes_shutdown": 1,
  "nodes_booting": 2,
  "nodes_wish": 4
}

#6 Updated by Tom Clegg 21 days ago

11349-nodemanager-status-api @ ab9a73d2c0b567d3c05d1d4d8463633a69eafda2

#7 Updated by Nico César 21 days ago

I see 2 clusters of questions I get often, one group is about "orchestration-related" or "pipeline-wide" and the other group of questions is about the resources inside a node when a job is running.

From the first group I usually get question like this (which this should help):
  • "why my job is queued for X hours?" -> having a historical # nodes in wishlist could potentially give a clue.
  • "my pipeline ran for 24 hours, which nodes did it use? " -> having a correlation of node with the pipeline helps.
From the second group:
  • "is my node actually doing something?" -> having a "node38: up" doesn't say much, I think that's a question to answer with logs
  • "how many cores/ram/big should my nodes have/be?" -> this is an analysis with the resources inside the node

so I think we can pull information from node manager to respond to the first group, usually this implies that the "node size" isn't as important as "how long has it been up and in which state" . so uniquely identifying the node than been able to plot that is good. But I have to admit that too much detail could turn this in to an Logstash nightmare-adventure I don't want to go, so some summarized state values as a first step is good.

the proposal is good:

{
  "nodes_up": 3,
  "nodes_shutdown": 1,
  "nodes_booting": 2,
  "nodes_wish": 4
}

later will be good to have unique node names and a way to report them over time (which makes it very difficult when they weren't born yet and in the "nodes_wish" pile)

#8 Updated by Tom Clegg 18 days ago

11349-nodemanager-status-api @ e7876a3ac520b128be7836e30172079ab2af5e45

#9 Updated by Lucas Di Pentima 17 days ago

Local test run was successful

Questions:
  • services/nodemanager/arvnodeman/status.py - Do you think it would be good idea to log messages indicating when no management server is started (and maybe the reason?)
  • services/nodemanager/tests/test_status.py:43 - Is that assertion superfluous given the following one? if it’s to prove that old values remain, can it be checked outside the loop?
  • Is the state of each node going to be included? (asking because it's mentioned on the story description)

#10 Updated by Tom Clegg 17 days ago

Lucas Di Pentima wrote:

Local test run was successful
  • services/nodemanager/arvnodeman/status.py - Do you think it would be good idea to log messages indicating when no management server is started (and maybe the reason?)

Yes, added.

        if not self.enabled:
            _logger.warning("Management server disabled. "+
                            "Use [Manage] config section to enable.")
            return
  • services/nodemanager/tests/test_status.py:43 - Is that assertion superfluous given the following one? if it’s to prove that old values remain, can it be checked outside the loop?

Yes, moved it outside the loop.

  • Is the state of each node going to be included? (asking because it's mentioned on the story description)

Indeed, we seem to have changed our minds about that: for now we just want a summary that we can graph easily.

Suggest adding "/nodes.json" with info about each node. (Not sure if we should keep this issue open for it or make a new one.)

11349-nodemanager-status-api @ a779382603d2da2ec38ceb8a21262cc4f151f077

#11 Updated by Lucas Di Pentima 17 days ago

LGTM, thanks!

#12 Updated by Tom Clegg 16 days ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF