Idea #11836
closed[Nodemanager] Improve status.json for monitoring
Description
11836-nodemanager-status-json
Example status.json
Lists node counts, information about node sizes, and individual node details.
{ "nodes_down": 0, "status": "OK", "node_compute-ah091aeky2404it-zzzzz": null, "timestamp": "2017-06-08T14:26:19Z", "nodes_idle": 0, "nodes_wish": 4, "nodes_shutdown": 0, "nodes_booting": 0, "size_Standard_D3": { "nodes_down": 0, "disk": 200, "name": "Standard_D3", "nodes_unpaired": 2, "nodes_idle": 0, "ram": 3325, "price": 0.56, "bandwidth": 0, "nodes_shutdown": 0, "nodes_booting": 0, "nodes_busy": 0, "id": "Standard_D3" }, "size_Standard_D4": { "nodes_down": 0, "disk": 400, "name": "Standard_D4", "nodes_unpaired": 0, "nodes_idle": 0, "ram": 6650, "price": 1.12, "bandwidth": 0, "nodes_shutdown": 0, "nodes_booting": 0, "nodes_busy": 0, "id": "Standard_D4" }, "node_compute-sl0mohfw1i58uyx-zzzzz": { "state": [ "unpaired", "closed", "boot wait", "idle exceeded" ], "arvados": null, "id": "compute-sl0mohfw1i58uyx-zzzzz", "size": "Standard_D3" }, "servicetype": "arvados_nodemanager", "nodes_unpaired": 2, "nodes_max": 8, "hostname": "9294ee7bb1cf", "node_compute-a2x3cjc8fjzk8qz-zzzzz": null, "node_compute-jozp0dj520ovozp-zzzzz": { "state": [ "idle", "closed", "boot wait", "idle exceeded" ], "arvados": { "status": "running", "first_ping_at": "2017-06-08T14:26:14.962205000Z", "modified_by_client_uuid": "zzzzz-ozdt8-obw7foaks3qjyej", "domain": false, "owner_uuid": "zzzzz-tpzed-d9tiejq69daie8f", "etag": "1d7e4jeyvufmpnr6vr4inin5l", "slot_number": 1, "crunch_worker_state": "idle", "ip_address": null, "properties": { "cloud_node": { "price": 0.56, "size": "Standard_D3" } }, "info": { "ec2_instance_id": "compute-jozp0dj520ovozp-zzzzz" }, "kind": "arvados#node", "uuid": "zzzzz-7ekkf-jozp0dj520ovozp", "modified_by_user_uuid": "zzzzz-tpzed-d9tiejq69daie8f", "nameservers": [ "192.168.1.1" ], "created_at": "2017-06-08T14:26:14.926534000Z", "hostname": "compute1", "modified_at": "2017-06-08T14:26:15.021194000Z", "href": "/nodes/zzzzz-7ekkf-jozp0dj520ovozp", "last_ping_at": "2017-06-08T14:26:14.962205000Z", "job_uuid": null }, "id": "compute-jozp0dj520ovozp-zzzzz", "size": "Standard_D3" }, "nodes_quota": 3, "version": "0.1.20170608142355", "node_compute-9qqojgiar0ezj2j-zzzzz": { "state": [ "idle", "closed", "boot wait", "idle exceeded" ], "arvados": { "status": "running", "first_ping_at": "2017-06-08T14:26:14.993218000Z", "modified_by_client_uuid": "zzzzz-ozdt8-obw7foaks3qjyej", "domain": false, "owner_uuid": "zzzzz-tpzed-d9tiejq69daie8f", "etag": "5qi8mjmofuq5f3bjj8h7kh20x", "slot_number": 2, "crunch_worker_state": "idle", "ip_address": "127.0.0.1", "properties": { "cloud_node": { "price": 0.56, "size": "Standard_D3" } }, "info": { "ec2_instance_id": "compute-9qqojgiar0ezj2j-zzzzz" }, "kind": "arvados#node", "uuid": "zzzzz-7ekkf-9qqojgiar0ezj2j", "modified_by_user_uuid": "zzzzz-tpzed-d9tiejq69daie8f", "nameservers": [ "192.168.1.1" ], "created_at": "2017-06-08T14:26:14.966885000Z", "hostname": "compute2", "modified_at": "2017-06-08T14:26:15.062881000Z", "href": "/nodes/zzzzz-7ekkf-9qqojgiar0ezj2j", "last_ping_at": "2017-06-08T14:26:14.993218000Z", "job_uuid": null }, "id": "compute-9qqojgiar0ezj2j-zzzzz", "size": "Standard_D3" }, "nodes_busy": 0 }
Updated by Nico César over 7 years ago
Does this status.json queries the API server to get the information on the "arvados" key? or is something that is already stored in node manager?
we've talked about having 2 URLS (/status.json and /status-full.json) https://dev.arvados.org/projects/ops/wiki/Status_URL_for_all_services#Improvements
The idea behind that is to have minimum impact on the service for a periodic status retrieve ( every second or few second for example, and also save storage on the kibana side )
maybe per node detail isn't needed in the status.json.
Updated by Peter Amstutz over 7 years ago
Nico César wrote:
Does this status.json queries the API server to get the information on the "arvados" key? or is something that is already stored in node manager?
It is already stored in node manager.
we've talked about having 2 URLS (/status.json and /status-full.json) https://dev.arvados.org/projects/ops/wiki/Status_URL_for_all_services#Improvements
That makes sense.
The idea behind that is to have minimum impact on the service for a periodic status retrieve ( every second or few second for example, and also save storage on the kibana side )
For node manager, generating status.json is just turning cached dict into a json string, it doesn't wait for any other parts of the programs.
maybe per node detail isn't needed in the status.json.
Ok. However I think we do want these new fields in status.json: "status", "timestamp", "servicetype", "version", "nodes_max" and "nodes_quota".
Updated by Nico César over 7 years ago
I'm looking at 1227ea2b5795e34a75c62cb9eae91d46ef7cfb6a
Once thinkg I notice is that we set updates['status'] = "OK", but if anything goes wrong we don't have a "ERROR" or "WARNING" that we can monitor. (of course stack traces end up in the log, but I'm talking about monitoring not about post-mortem)
So one option is to add more try:...except blocks and reflect that there is something with the creation of the status page (WARNING) or a major error that we should address (ERROR).
Any fatal() or warning() should be updating this on the status page too.
Adding a started_at with the GMT time when the daemon was started is a good thing
(added this to the doc)
Updated by Peter Amstutz over 7 years ago
- Target version changed from 2017-06-21 sprint to Arvados Future Sprints
Updated by Tom Morris over 6 years ago
- Status changed from In Progress to Rejected
Updated by Tom Morris over 6 years ago
- Target version deleted (
Arvados Future Sprints)