Story #11836

[Nodemanager] Improve status.json for monitoring

Added by Peter Amstutz over 3 years ago. Updated over 2 years ago.

Status:
Rejected
Priority:
Normal
Assigned To:
Category:
-
Target version:
-
Start date:
05/23/2018
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
0.5

Description

11836-nodemanager-status-json

Example status.json

Lists node counts, information about node sizes, and individual node details.

{
  "nodes_down": 0,
  "status": "OK",
  "node_compute-ah091aeky2404it-zzzzz": null,
  "timestamp": "2017-06-08T14:26:19Z",
  "nodes_idle": 0,
  "nodes_wish": 4,
  "nodes_shutdown": 0,
  "nodes_booting": 0,
  "size_Standard_D3": {
    "nodes_down": 0,
    "disk": 200,
    "name": "Standard_D3",
    "nodes_unpaired": 2,
    "nodes_idle": 0,
    "ram": 3325,
    "price": 0.56,
    "bandwidth": 0,
    "nodes_shutdown": 0,
    "nodes_booting": 0,
    "nodes_busy": 0,
    "id": "Standard_D3" 
  },
  "size_Standard_D4": {
    "nodes_down": 0,
    "disk": 400,
    "name": "Standard_D4",
    "nodes_unpaired": 0,
    "nodes_idle": 0,
    "ram": 6650,
    "price": 1.12,
    "bandwidth": 0,
    "nodes_shutdown": 0,
    "nodes_booting": 0,
    "nodes_busy": 0,
    "id": "Standard_D4" 
  },
  "node_compute-sl0mohfw1i58uyx-zzzzz": {
    "state": [
      "unpaired",
      "closed",
      "boot wait",
      "idle exceeded" 
    ],
    "arvados": null,
    "id": "compute-sl0mohfw1i58uyx-zzzzz",
    "size": "Standard_D3" 
  },
  "servicetype": "arvados_nodemanager",
  "nodes_unpaired": 2,
  "nodes_max": 8,
  "hostname": "9294ee7bb1cf",
  "node_compute-a2x3cjc8fjzk8qz-zzzzz": null,
  "node_compute-jozp0dj520ovozp-zzzzz": {
    "state": [
      "idle",
      "closed",
      "boot wait",
      "idle exceeded" 
    ],
    "arvados": {
      "status": "running",
      "first_ping_at": "2017-06-08T14:26:14.962205000Z",
      "modified_by_client_uuid": "zzzzz-ozdt8-obw7foaks3qjyej",
      "domain": false,
      "owner_uuid": "zzzzz-tpzed-d9tiejq69daie8f",
      "etag": "1d7e4jeyvufmpnr6vr4inin5l",
      "slot_number": 1,
      "crunch_worker_state": "idle",
      "ip_address": null,
      "properties": {
        "cloud_node": {
          "price": 0.56,
          "size": "Standard_D3" 
        }
      },
      "info": {
        "ec2_instance_id": "compute-jozp0dj520ovozp-zzzzz" 
      },
      "kind": "arvados#node",
      "uuid": "zzzzz-7ekkf-jozp0dj520ovozp",
      "modified_by_user_uuid": "zzzzz-tpzed-d9tiejq69daie8f",
      "nameservers": [
        "192.168.1.1" 
      ],
      "created_at": "2017-06-08T14:26:14.926534000Z",
      "hostname": "compute1",
      "modified_at": "2017-06-08T14:26:15.021194000Z",
      "href": "/nodes/zzzzz-7ekkf-jozp0dj520ovozp",
      "last_ping_at": "2017-06-08T14:26:14.962205000Z",
      "job_uuid": null
    },
    "id": "compute-jozp0dj520ovozp-zzzzz",
    "size": "Standard_D3" 
  },
  "nodes_quota": 3,
  "version": "0.1.20170608142355",
  "node_compute-9qqojgiar0ezj2j-zzzzz": {
    "state": [
      "idle",
      "closed",
      "boot wait",
      "idle exceeded" 
    ],
    "arvados": {
      "status": "running",
      "first_ping_at": "2017-06-08T14:26:14.993218000Z",
      "modified_by_client_uuid": "zzzzz-ozdt8-obw7foaks3qjyej",
      "domain": false,
      "owner_uuid": "zzzzz-tpzed-d9tiejq69daie8f",
      "etag": "5qi8mjmofuq5f3bjj8h7kh20x",
      "slot_number": 2,
      "crunch_worker_state": "idle",
      "ip_address": "127.0.0.1",
      "properties": {
        "cloud_node": {
          "price": 0.56,
          "size": "Standard_D3" 
        }
      },
      "info": {
        "ec2_instance_id": "compute-9qqojgiar0ezj2j-zzzzz" 
      },
      "kind": "arvados#node",
      "uuid": "zzzzz-7ekkf-9qqojgiar0ezj2j",
      "modified_by_user_uuid": "zzzzz-tpzed-d9tiejq69daie8f",
      "nameservers": [
        "192.168.1.1" 
      ],
      "created_at": "2017-06-08T14:26:14.966885000Z",
      "hostname": "compute2",
      "modified_at": "2017-06-08T14:26:15.062881000Z",
      "href": "/nodes/zzzzz-7ekkf-9qqojgiar0ezj2j",
      "last_ping_at": "2017-06-08T14:26:14.993218000Z",
      "job_uuid": null
    },
    "id": "compute-9qqojgiar0ezj2j-zzzzz",
    "size": "Standard_D3" 
  },
  "nodes_busy": 0
}

Subtasks

Task #11845: Review 11836-nodemanager-status-jsonClosedNico César


Related issues

Related to Arvados - Story #11349: [Node Manager] Add status URL for node managerResolved04/10/2017

Related to Arvados - Story #12085: Add monitoring/alarm for failed/slow job dispatch & excess idle nodesResolved08/08/2017

History

#1 Updated by Peter Amstutz over 3 years ago

  • Description updated (diff)

#2 Updated by Nico César over 3 years ago

Does this status.json queries the API server to get the information on the "arvados" key? or is something that is already stored in node manager?

we've talked about having 2 URLS (/status.json and /status-full.json) https://dev.arvados.org/projects/ops/wiki/Status_URL_for_all_services#Improvements

The idea behind that is to have minimum impact on the service for a periodic status retrieve ( every second or few second for example, and also save storage on the kibana side )

maybe per node detail isn't needed in the status.json.

#3 Updated by Peter Amstutz over 3 years ago

Nico César wrote:

Does this status.json queries the API server to get the information on the "arvados" key? or is something that is already stored in node manager?

It is already stored in node manager.

we've talked about having 2 URLS (/status.json and /status-full.json) https://dev.arvados.org/projects/ops/wiki/Status_URL_for_all_services#Improvements

That makes sense.

The idea behind that is to have minimum impact on the service for a periodic status retrieve ( every second or few second for example, and also save storage on the kibana side )

For node manager, generating status.json is just turning cached dict into a json string, it doesn't wait for any other parts of the programs.

maybe per node detail isn't needed in the status.json.

Ok. However I think we do want these new fields in status.json: "status", "timestamp", "servicetype", "version", "nodes_max" and "nodes_quota".

#4 Updated by Peter Amstutz over 3 years ago

  • Assigned To set to Peter Amstutz

#5 Updated by Nico César over 3 years ago

I'm looking at 1227ea2b5795e34a75c62cb9eae91d46ef7cfb6a

Once thinkg I notice is that we set updates['status'] = "OK", but if anything goes wrong we don't have a "ERROR" or "WARNING" that we can monitor. (of course stack traces end up in the log, but I'm talking about monitoring not about post-mortem)

So one option is to add more try:...except blocks and reflect that there is something with the creation of the status page (WARNING) or a major error that we should address (ERROR).

Any fatal() or warning() should be updating this on the status page too.

Adding a started_at with the GMT time when the daemon was started is a good thing

(added this to the doc)

#6 Updated by Peter Amstutz over 3 years ago

  • Target version changed from 2017-06-21 sprint to Arvados Future Sprints

#7 Updated by Tom Morris almost 3 years ago

  • Status changed from In Progress to Rejected

#8 Updated by Tom Morris over 2 years ago

  • Target version deleted (Arvados Future Sprints)

Also available in: Atom PDF