Project

General

Profile

Actions

Idea #11349

closed

[Node Manager] Add status URL for node manager

Added by Tom Morris about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
04/10/2017
Due date:
Story points:
2.0

Description

Implemented HTTP server which serves status URL with JSON format output
Configurable port #

Start with the data which is currently being logged:
  • List of nodes sizes
  • Number of nodes in each state
  • State of each node

Subtasks 2 (0 open2 closed)

Task #11375: Review 11349-nodemanager-status-apiResolvedTom Clegg04/10/2017Actions
Task #11447: update wikiResolvedTom Clegg04/10/2017Actions

Related issues

Related to Arvados - Feature #11799: [Node manager] Publish status.jsonDuplicateActions
Related to Arvados - Idea #11836: [Nodemanager] Improve status.json for monitoringRejectedPeter Amstutz05/23/2018Actions
Actions #1

Updated by Tom Morris about 7 years ago

  • Description updated (diff)
  • Story points set to 2.0
Actions #2

Updated by Tom Clegg about 7 years ago

See source:sdk/python/tests/keepstub.py and source:sdk/python/tests/test_keep_client.py for example of starting up a multithreaded http server.

Suggest maintaining a global status variable, protected by a mutex, and just dumping its content in the status.json handler.

Actions #3

Updated by Tom Clegg about 7 years ago

  • Assigned To set to Tom Clegg
  • Target version changed from Arvados Future Sprints to 2017-04-12 sprint
Actions #4

Updated by Tom Clegg about 7 years ago

  • Status changed from New to In Progress
Actions #5

Updated by Tom Clegg about 7 years ago

details (proposed):

New config section "[Manage]" with "port" (127.0.0.1) and "address" (default -1, which disables management server)

status.json response

{
  "nodes_up": 3,
  "nodes_shutdown": 1,
  "nodes_booting": 2,
  "nodes_wish": 4
}
Actions #6

Updated by Tom Clegg about 7 years ago

11349-nodemanager-status-api @ ab9a73d2c0b567d3c05d1d4d8463633a69eafda2

Actions #7

Updated by Nico César about 7 years ago

I see 2 clusters of questions I get often, one group is about "orchestration-related" or "pipeline-wide" and the other group of questions is about the resources inside a node when a job is running.

From the first group I usually get question like this (which this should help):
  • "why my job is queued for X hours?" -> having a historical # nodes in wishlist could potentially give a clue.
  • "my pipeline ran for 24 hours, which nodes did it use? " -> having a correlation of node with the pipeline helps.
From the second group:
  • "is my node actually doing something?" -> having a "node38: up" doesn't say much, I think that's a question to answer with logs
  • "how many cores/ram/big should my nodes have/be?" -> this is an analysis with the resources inside the node

so I think we can pull information from node manager to respond to the first group, usually this implies that the "node size" isn't as important as "how long has it been up and in which state" . so uniquely identifying the node than been able to plot that is good. But I have to admit that too much detail could turn this in to an Logstash nightmare-adventure I don't want to go, so some summarized state values as a first step is good.

the proposal is good:

{
  "nodes_up": 3,
  "nodes_shutdown": 1,
  "nodes_booting": 2,
  "nodes_wish": 4
}

later will be good to have unique node names and a way to report them over time (which makes it very difficult when they weren't born yet and in the "nodes_wish" pile)

Actions #8

Updated by Tom Clegg about 7 years ago

11349-nodemanager-status-api @ e7876a3ac520b128be7836e30172079ab2af5e45

Actions #9

Updated by Lucas Di Pentima about 7 years ago

Local test run was successful

Questions:
  • services/nodemanager/arvnodeman/status.py - Do you think it would be good idea to log messages indicating when no management server is started (and maybe the reason?)
  • services/nodemanager/tests/test_status.py:43 - Is that assertion superfluous given the following one? if it’s to prove that old values remain, can it be checked outside the loop?
  • Is the state of each node going to be included? (asking because it's mentioned on the story description)
Actions #10

Updated by Tom Clegg about 7 years ago

Lucas Di Pentima wrote:

Local test run was successful
  • services/nodemanager/arvnodeman/status.py - Do you think it would be good idea to log messages indicating when no management server is started (and maybe the reason?)

Yes, added.

        if not self.enabled:
            _logger.warning("Management server disabled. "+
                            "Use [Manage] config section to enable.")
            return
  • services/nodemanager/tests/test_status.py:43 - Is that assertion superfluous given the following one? if it’s to prove that old values remain, can it be checked outside the loop?

Yes, moved it outside the loop.

  • Is the state of each node going to be included? (asking because it's mentioned on the story description)

Indeed, we seem to have changed our minds about that: for now we just want a summary that we can graph easily.

Suggest adding "/nodes.json" with info about each node. (Not sure if we should keep this issue open for it or make a new one.)

11349-nodemanager-status-api @ a779382603d2da2ec38ceb8a21262cc4f151f077

Actions #11

Updated by Lucas Di Pentima about 7 years ago

LGTM, thanks!

Actions #12

Updated by Tom Clegg about 7 years ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF