Project

General

Profile

Actions

Feature #19377

closed

[diagnostics] show health-check response

Added by Tom Clegg 6 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
10/05/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-
Release relationship:
Auto

Description

arvados-client sudo diagnostics should perform a cluster-wide health check. This should work even if the "health aggregator" service is not running.

(Note we can't do this in the non-sudo case because arvados-client won't have the management token needed to query health-check endpoints. In general, arvados-client diagnostics doesn't even run on the cluster's internal network, so it can't necessarily connect to all internal services.)

Add text to install guide about how to use the diagnostics and health check to establish confidence that a cluster is correctly installed & working.

Do this assuming correct information from the config file. We suspect some clusters may not have usable InternalURL values, in these cases we will fix the configuration and/or the installer.


Subtasks 2 (0 open2 closed)

Task #19576: Review 19377-diagnostics-health-checkResolvedPeter Amstutz10/05/2022

Actions
Task #19621: Review 19377-verboseResolvedPeter Amstutz10/12/2022

Actions

Related issues

Related to Arvados Epics - Story #18685: Synchronize configuration on multi-node clusterNew10/01/202203/31/2023

Actions
Related to Arvados - Story #19364: Document use of diagnostics & health check to check running versions, config file matching, & overall cluster functioningResolvedTom Clegg10/12/2022

Actions
Actions #1

Updated by Peter Amstutz 6 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 6 months ago

  • Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Actions #3

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #4

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Actions #5

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-10-12 sprint to 2022-09-28 sprint
Actions #6

Updated by Peter Amstutz 5 months ago

  • Description updated (diff)
Actions #7

Updated by Peter Amstutz 5 months ago

  • Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Actions #8

Updated by Peter Amstutz 4 months ago

  • Related to Story #18685: Synchronize configuration on multi-node cluster added
Actions #9

Updated by Peter Amstutz 4 months ago

  • Assigned To set to Tom Clegg
Actions #10

Updated by Tom Clegg 4 months ago

  • Status changed from New to In Progress
Actions #11

Updated by Tom Clegg 4 months ago

19377-diagnostics-health-check @ fe5fc284f0dd1cdd0c01922a98bcf75b4186e80a -- developer-run-tests: #3309

(wb1 failure is surely not related)

Actions #12

Updated by Tom Clegg 4 months ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved
Actions #13

Updated by Peter Amstutz 4 months ago

  • Status changed from Resolved to Feedback
Actions #14

Updated by Peter Amstutz 4 months ago

Comments about arvados-server check, which of course also applies to arvados-client diagnostics.

So arvados-server check doesn't print anything by default. This doesn't inspire confidence, it just left me wondering if anything happened or if I did something wrong.

Diagnostics is only slightly better:

INFO       5: running health check (same as `arvados-server check`) 
INFO      ... health check: reported clock skew 0s 

A message like "all nodes report status OK with version X and with config hash Y, pass -yaml for more information" would be a big improvement. Then have a "-quiet" option that really suppresses all non-error output.

Start time or uptime for each service would also be a handy piece of information to return. I don't know if that's in scope for this story.

Is there a useful distinction between "Checks" and "Services" here? Also I have literally no idea what "N" means:

Checks:
  arvados-api-server+http://localhost:8004/_health/ping:
    ClockTime: "2022-10-06T15:17:20Z" 
    ConfigSourceSHA256: 241ebaba8e8336f07978c13dbd9bce185ab4c467a7c306088fbc588131976840
    ConfigSourceTimestamp: "2022-06-07T11:10:27.69153Z" 
    HTTPStatusCode: 200
    Health: OK
    Response:
      health: OK
    ResponseTime: 0.013167
    Version: 2.5.0~dev20221005202835
Services:
  arvados-api-server:
    Health: OK
    "N": 1
Actions #15

Updated by Tom Clegg 4 months ago

  • Status changed from Feedback to In Progress

Added a "health check OK" message, and a -quiet flag to disable it.

19377-diagnostics-health-check @ c1af8a9ad9d5d8fc0e6bf49ad2bcc9119c85bfdc -- developer-run-tests: #3312 (wb1 fail)

I think we could use a human readable cluster info/status report somewhere in between "OK" and "way too much detail but not much useful information.yaml".

I think that deserves some thought, because it doesn't seem like a great fit for either health checks (which are intended for monitoring/alerting and load balancing, hence primarily machine-readable) or diagnostics (which are intended to exercise the cluster and detect specific problems).

Actions #16

Updated by Peter Amstutz 4 months ago

  • Related to Story #19364: Document use of diagnostics & health check to check running versions, config file matching, & overall cluster functioning added
Actions #17

Updated by Tom Clegg 4 months ago

We discussed & agreed "arvados-client diagnostics" should have a -v verbose mode that prints lots of additional information even when diagnostic tests are passing, so diagnostics can serve as a single report with both (a) assurance that [some] things are working, and (b) cluster details that are likely to help troubleshooting problems that aren't specifically checked by diagnostics. Possible examples:
  • Arvados version
  • Nginx version
  • config file timestamp/checksum
  • hostname of machine where arvados-client diagnostics is running
Actions #18

Updated by Peter Amstutz 4 months ago

  • Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Actions #19

Updated by Peter Amstutz 4 months ago

Tom Clegg wrote in #note-15:

Added a "health check OK" message, and a -quiet flag to disable it.

19377-diagnostics-health-check @ c1af8a9ad9d5d8fc0e6bf49ad2bcc9119c85bfdc -- developer-run-tests: #3312 (wb1 fail)

I think we could use a human readable cluster info/status report somewhere in between "OK" and "way too much detail but not much useful information.yaml".

I think that deserves some thought, because it doesn't seem like a great fit for either health checks (which are intended for monitoring/alerting and load balancing, hence primarily machine-readable) or diagnostics (which are intended to exercise the cluster and detect specific problems).

This LGTM.

Actions #20

Updated by Tom Clegg 4 months ago

19377-verbose @ 58afa4202a5ff084fe7eee1a5274e02071858ba5 -- developer-run-tests: #3318

Added version reporting, and some info that we previously showed when using "debug" log level.

(we have "loglevel=debug" and "verbose" as distinct features -- "debug" shows more detail about how the diagnostic checks are succeeding/failing, while "verbose" shows additional information about the state of the cluster -- does this seem right?)

example

api.ce8i5:~# ./arvados-client sudo diagnostics -v
INFO      ... hostname = api.ce8i5.arvadosapi.com
INFO       5: running health check (same as `arvados-server check`)
INFO      ... reported clock skew = 0s
INFO      ... arvados version = 2.5.0~dev20221013155415
INFO      ... http frontend version = nginx/1.14.0 + Phusion Passenger(R) 6.0.15
INFO      ... http frontend version = nginx/1.14.0 (Ubuntu)
INFO      ... config file sha256 = 241ebaba8e8336f07978c13dbd9bce185ab4c467a7c306088fbc588131976840
INFO      10: getting discovery document from https://ce8i5.arvadosapi.com/discovery/v1/apis/arvados/v1/rest
INFO      ... BlobSignatureTTL = 1209600
INFO      20: getting exported config from https://ce8i5.arvadosapi.com/arvados/v1/config
INFO      ... Collections.BlobSigning = true
INFO      30: getting current user record
INFO      ... user uuid = ce8i5-tpzed-000000000000000
INFO      40: connecting to service endpoint https://keep.ce8i5.arvadosapi.com/
INFO      41: connecting to service endpoint https://*.collections.ce8i5.arvadosapi.com/
...
Actions #21

Updated by Peter Amstutz 3 months ago

Ok, this is a bit of a weird case. I ran "arvados-client diagnostics" from inside an arvbox instance, but set ARVADOS_API_HOST to point at ce8i5. It turns out that healthcheck reads /etc/arvados/config.yml so it contacted the arvbox instance for the healthcheck but then ran the rest of the diagnostics against ce8i5 as intended.

Could we add a basic sanity check that /etc/arvados/config.yml and ARVADOS_API_HOST are talking about the same cluster, maybe by comparing the ExternalURL for controller?

arvbox@ad53c3218ece:/var/lib/gopath/bin$ ./arvados-client diagnostics -external-client -v
INFO      ... hostname = ad53c3218ece                
INFO       5: running health check (same as `arvados-server check`) 
INFO      ... skipping because provided token is not SystemRootToken 
ERROR   health check: keep-balance: MISSING: no InternalURLs configured 
ERROR   health check: version mismatch: arvados-api-server+http://localhost:8004 is running 2.4.2~dev20220803202846 -- expected dev (go1.17.7) 
ERROR   health check: version mismatch: arvados-workbench1+https://172.17.0.2:443 is running 2.3.3~dev20220214191519 -- expected dev (go1.17.7) 
INFO      ... reported clock skew = 0s               
INFO      ... arvados version = dev                  
INFO      ... arvados version = 2.4.2~dev20220803202846 
INFO      ... arvados version = 2.3.3~dev20220214191519 
INFO      ... http frontend version = nginx/1.15.8 + Phusion Passenger 6.0.2 
INFO      ... http frontend version = nginx/1.14.2   
INFO      ... config file sha256 = 9dd86a9f3c90086e5328f3103f34bcf8f13eee96deb529d7a1ab9dd370954427 
INFO      10: getting discovery document from https://pirca.arvadosapi.com/discovery/v1/apis/arvados/v1/rest 
INFO      ... BlobSignatureTTL = 1209600             

Also, when health check reported an error, it should mention that you can run "arvados-server check -yaml" to get the comprehensive report for each node.

Actions #22

Updated by Tom Clegg 3 months ago

Peter Amstutz wrote in #note-21:

Could we add a basic sanity check that /etc/arvados/config.yml and ARVADOS_API_HOST are talking about the same cluster, maybe by comparing the ExternalURL for controller?

Oops. I fixed the existing sanity check that nearly worked there -- it said "skipping" but then went on to do the health check anyway. Now it says "skipping" and actually skips.

(My first thought was also to check the controller URL but -- unlike the token -- host:port in a URL can be equivalent without being equal, so I thought checking the token was the easiest way to check whether we're really in the "arvados-client sudo" situation vs. an unrelated config file just happens to be readable.)

Also, when health check reported an error, it should mention that you can run "arvados-server check -yaml" to get the comprehensive report for each node.

Added.

19377-verbose @ 14f6625379992bc3ee054ad419095b476a1c4284 -- developer-run-tests: #3325

Actions #23

Updated by Peter Amstutz 3 months ago

Tom Clegg wrote in #note-22:

Oops. I fixed the existing sanity check that nearly worked there -- it said "skipping" but then went on to do the health check anyway. Now it says "skipping" and actually skips.

(My first thought was also to check the controller URL but -- unlike the token -- host:port in a URL can be equivalent without being equal, so I thought checking the token was the easiest way to check whether we're really in the "arvados-client sudo" situation vs. an unrelated config file just happens to be readable.)

So it does skip it now, but the error message still doesn't really explain what happened. It should say something like "the system root token that was read from /etc/arvados/config.yml doesn't match the token in $ARVADOS_API_TOKEN".

Actually I'm playing with it some more and I'm more confused,

Also, when health check reported an error, it should mention that you can run "arvados-server check -yaml" to get the comprehensive report for each node.

Added.

This is logged at "INFO" level, which means it doesn't show up in the error summary.

INFO      ... consider running `arvados-server check -yaml` for a comprehensive report 

...30 more lines of stuff...

--- cut here --- error summary ---

ERROR   health check: keep-balance: MISSING: no InternalURLs configured 
ERROR   health check: version mismatch: arvados-workbench1+https://172.17.0.3:443 is running 2.3.3~dev20220214191519 -- expected dev (go1.17.7) 
ERROR   health check: version mismatch: arvados-api-server+http://localhost:8004 is running 2.4.2~dev20220803202846 -- expected dev (go1.17.7) 
ERROR    160: running a container (8210 ms): container request x94bf-xvhdp-7blvxox8522h2zk is final but container x94bf-dz642-igejnw1p2dzjb9o did not complete: container state = "Cancelled" 
$ 

Also, I had a thought that -json output option would be handy to be able to run it through jq.

Actions #24

Updated by Tom Clegg 3 months ago

So it does skip it now, but the error message still doesn't really explain what happened. It should say something like "the system root token that was read from /etc/arvados/config.yml doesn't match the token in $ARVADOS_API_TOKEN".

Changed to

skipping because SystemRootToken read from /etc/arvados/config.yml does not match $ARVADOS_API_TOKEN (consider using 'arvados-client sudo diagnostics' to load endpoint and token from config file instead of environment)

Actually I'm playing with it some more and I'm more confused,

?

This is logged at "INFO" level, which means it doesn't show up in the error summary.

Right, "consider using x to get more detail" is not an error.

Repeating the errors at the end is meant to (a) ensure that, if any errors/problems are detected, the user's terminal window is clearly showing errors when the command finishes, even if the terminal is 80x25 and lots of other non-error messages have been shown and (b) offer a concise checklist of problems to fix, without by-the-way logging noise. I think repeating non-error informational messages too would tend to defeat both purposes.

19377-verbose @ bef451193ffe6c7239c545ed1c93769b50c9e0a8 -- developer-run-tests: #3336

Actions #25

Updated by Peter Amstutz 3 months ago

  • Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Actions #26

Updated by Peter Amstutz 3 months ago

Tom Clegg wrote in #note-24:

Actually I'm playing with it some more and I'm more confused,

?

I think I started writing that, then I figured out what happened and wrote the note about improving the error message, and forgot to delete the original text. Whoops.

This is logged at "INFO" level, which means it doesn't show up in the error summary.

Right, "consider using x to get more detail" is not an error.

Repeating the errors at the end is meant to (a) ensure that, if any errors/problems are detected, the user's terminal window is clearly showing errors when the command finishes, even if the terminal is 80x25 and lots of other non-error messages have been shown and (b) offer a concise checklist of problems to fix, without by-the-way logging noise. I think repeating non-error informational messages too would tend to defeat both purposes.

19377-verbose @ bef451193ffe6c7239c545ed1c93769b50c9e0a8 -- developer-run-tests: #3336

My concern is that because it isn't repeated, most people will never see it, because why would you scroll up to look at the same error messages that you already have at the bottom. So right at the point in time you most need help debugging, the option most likely to help you isn't being presented front and center.

On the same theme, it might be better to have "skipping because SystemRootToken" be logged as an error, because it is almost certainly a user error and having it say "skipping" at the info level gives a false sense that things are fine when actually it has no idea if things are fine or not.

However if you're unconvinced don't want to change it, go ahead and merge.

Actions #27

Updated by Tom Clegg 3 months ago

Peter Amstutz wrote in #note-26:

So right at the point in time you most need help debugging, the option most likely to help you isn't being presented front and center.

Hm, I guess I'm just not seeing the situation where the full health-check yaml output is especially helpful. Is there something we could add to the error messages instead of advising the user to look for clues in a yaml doc?

On the same theme, it might be better to have "skipping because SystemRootToken" be logged as an error, because it is almost certainly a user error and having it say "skipping" at the info level gives a false sense that things are fine when actually it has no idea if things are fine or not.

I like that option. I've made this an error, with advice about how to get rid of it one way or the other.

19377-verbose @ d5df19fdff62724f9faeb3bee17201363071bf9e -- developer-run-tests: #3344

Actions #28

Updated by Tom Clegg 3 months ago

  • Status changed from In Progress to Resolved
Actions #29

Updated by Peter Amstutz about 2 months ago

  • Release set to 47
Actions

Also available in: Atom PDF