Project

General

Profile

Actions

Feature #18794

closed

cluster health check fails if some services are using different configs

Added by Tom Clegg 4 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
-
Target version:
Start date:
05/06/2022
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Overall health check returned by health aggregator returns “error” state unless the /etc/arvados/config.yml file is identical on all cluster hosts and all services are using that version
  • Error message is “some services have mismatched configuration: serviceA, serviceB config file hash does not match latest version with timestamp X”
  • Whether or not the health aggregator service is installed/configured, `arvados-server health` (when run on a cluster node) returns the same result the health aggregator would return, as text on stdout and a non-zero exit code to indicate fail.
Implementation:
  • `arvados-server check` (new subcommand) loads config file, fetches all the individual services’ health checks just as the health aggregator service would, fetches the config metrics and compares hashes, and displays the result on stdout
    • if all hashes match, config is OK, regardless of timestamps
    • if some hashes don’t match, the most recent timestamp is assumed to be the real/intended config
  • `arvados-server config-dump` output includes the content hash and timestamp of the source yaml file it loaded (timestamp is now if source file is a pipe)
  • all services gain two “config” metrics:
    • the hash and timestamp of the config file the current config was loaded from
    • the hash and timestamp of the current config file on disk
  • services also gain a "restarted at" or "config loaded at" timestamp metric with the last time the service restarted and/or reloaded its config
  • RailsAPI and Workbench1: the only practical way to know which config versions are in use by various passenger workers is to implement auto-reload:
    • immediately before loading config, and periodically in a background job, if tmp/restart.txt is older than the config file, touch restart.txt using the config file timestamp
    • health check endpoint reports an error if the tmp/restart.txt file is older than the config file
    • config version endpoint handler does not try to figure out whether all passenger workers use the same config as the one handling the request – if there’s a mismatch, assume it will be flagged by the health check endpoint

Note this does not address discrepancies in Nginx configuration in use / on disk, since it is maintained manually.

TBD: It is possible for a host to have an outdated config file on disk, even though all services on that host are using the same correct/latest config as other hosts (e.g., after an operator mistakenly copies an old config file to the host). How should this be reported so the operator can make sense of it – “some services have out-of-date configuration: serviceA (config file on disk); serviceB (config file on disk)”?


Subtasks 1 (0 open1 closed)

Task #18952: Review 18794-config-healthResolvedWard Vandewege05/06/2022

Actions

Related issues

Related to Arvados - Feature #18768: Design for ability to check what config is in use across the clusterResolvedTom Clegg

Actions
Related to Arvados Epics - Story #18727: Avoid configuration skew between different services and hostsResolved03/01/202205/31/2022

Actions
Actions #1

Updated by Tom Clegg 4 months ago

  • Related to Feature #18768: Design for ability to check what config is in use across the cluster added
Actions #2

Updated by Tom Clegg 4 months ago

  • Description updated (diff)
Actions #3

Updated by Tom Clegg 4 months ago

  • Description updated (diff)
Actions #4

Updated by Peter Amstutz 4 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz 3 months ago

  • Related to Story #18727: Avoid configuration skew between different services and hosts added
Actions #6

Updated by Peter Amstutz 3 months ago

  • Target version set to 2022-04-13 Sprint
  • Assigned To set to Tom Clegg
Actions #7

Updated by Tom Clegg 3 months ago

  • Status changed from New to In Progress
  • Description updated (diff)
Actions #8

Updated by Tom Clegg 3 months ago

18794-config-health @ 541c300ca6abb15449e917f648ae5ffd68087ff9 -- developer-run-tests: #3025
  • add arvados-server check
18794-config-health @ 7f439b0bfe215b3641c054a73677ba118023f596 -- developer-run-tests: #3028
  • add SourceTimestamp config field
Actions #10

Updated by Tom Clegg 3 months ago

  • Target version changed from 2022-04-13 Sprint to 2022-04-27 Sprint
Actions #11

Updated by Tom Clegg 3 months ago

Currently `arvados-server check` prints a hunk of yaml, and exits 0 for OK. Suggestions for something more readable?

There are some pre-existing checks that don't have a human-readable summary, like "you don't have any keepstores configured". We probably need to resolve that before we can recommend using `arvados-server check`.

Maybe review/merge the "check config mismatches" first, then keep issue open to bring `arvados-server check` up to snuff, document it, etc.?

Either way, still todo:
  • auto-reload RailsAPI and Workbench1 as described above
  • add metrics to Workbench1, similar to RailsAPI changes here
Actions #12

Updated by Peter Amstutz 2 months ago

Do we have a way to query the Arvados version of all the services? I think that would be in scope for this feature.

Actions #13

Updated by Tom Clegg 2 months ago

Peter Amstutz wrote:

Do we have a way to query the Arvados version of all the services? I think that would be in scope for this feature.

Easy to add that as a metric on the Go services. Not sure what we should do with the Rails apps and Workbench2.

Actions #14

Updated by Tom Clegg 2 months ago

18794-config-health @ 1a3a8b0a7cc31e9bfb63c6001bc049a831e13dd1 -- developer-run-tests: #3075
  • RailsAPI auto-reloads config when config file changes (without checking whether the updated file is loadable)
  • New version metric:
    arvados_version_running{version="d9f70593c6a460abbb93514ac202724391148d0e-dev (go1.17.7)"} 1
    (not sure about the naming -- suggestions?)
Actions #15

Updated by Peter Amstutz 2 months ago

For rails apps

  • report version that is installed (read package version number at startup time)
Actions #16

Updated by Peter Amstutz 2 months ago

  • Target version changed from 2022-04-27 Sprint to 2022-05-11 sprint
Actions #17

Updated by Tom Clegg 2 months ago

18794-config-health @ 35bb2d98244d40106c0493385f90cdd198c91447 -- developer-run-tests: #3078

The new railsRestartSuite.TestConfigReload test in lib/controller failed here, but so far works reliably on my machine. Haven't yet figured out whether it's flaky, or relying on something special about my dev box. Re-running at developer-run-tests-remainder: #3222

Actions #18

Updated by Tom Clegg about 2 months ago

18794-config-health @ fc4f842bfa62b4ab5e6a9d233eb88cc78758f376 -- developer-run-tests: #3093
retry wb1 developer-run-tests-apps-workbench-integration: #3307

example /metrics response from railsapi:

# HELP arvados_config_load_timestamp_seconds Time when config file was loaded.
# TYPE arvados_config_load_timestamp_seconds gauge
arvados_config_load_timestamp_seconds{sha256="d78120560ab0e9edcba75ff7a6e32cbc11ecce6867ddd3a0c29e2897888b5908"} 1651764701.407278
# HELP arvados_config_source_timestamp_seconds Timestamp of config file when it was loaded.
# TYPE arvados_config_source_timestamp_seconds gauge
arvados_config_source_timestamp_seconds{sha256="d78120560ab0e9edcba75ff7a6e32cbc11ecce6867ddd3a0c29e2897888b5908"} 1651764473.1509392
# HELP arvados_version_running Indicated version is running.
# TYPE arvados_version_running gauge
arvados_version_running{version="2.5.0~dev20220502191252"} 1

(Since note-11) arvados-server check now silently exits 0 if there are no errors to report. There is a -yaml option to output the detailed report (this could be useful for support requests).

Actions #19

Updated by Tom Clegg about 2 months ago

18794-config-health @ 00542ded608212e8245429001f1a0a0a736f2e71 -- developer-run-tests: #3095

Fixes a problem noticed while testing workbench-auto-reload on 9tee4 (if restart.txt exists and is owned by root, we can't change its modtime, we need to remove it and write a new one).

Actions #20

Updated by Ward Vandewege about 2 months ago

Tom Clegg wrote:

18794-config-health @ 00542ded608212e8245429001f1a0a0a736f2e71 -- developer-run-tests: #3095

Fixes a problem noticed while testing workbench-auto-reload on 9tee4 (if restart.txt exists and is owned by root, we can't change its modtime, we need to remove it and write a new one).

Just one small nit:

In sdk/go/health/aggregator_test.go, func (s *AggregatorSuite) TestIsLocalHost(c *check.C) {, this line seems a little dubious, since it introduces a dns resolution dependency on the example.com domain, and the test could also break if they ever change example.com to resolve to some form of a localhost IP. Is this check something that should be in our test suite? If you want to test for a public IP, you could just use any arbitrary non-localhost IP, right?

+       c.Check(isLocalHost("example.com"), check.Equals, false)

Otherwise, LGTM!

Actions #21

Updated by Tom Clegg about 2 months ago

Ward Vandewege wrote:

In sdk/go/health/aggregator_test.go, func (s *AggregatorSuite) TestIsLocalHost(c *check.C) {, this line seems a little dubious, since it introduces a dns resolution dependency on the example.com domain, and the test could also break if they ever change example.com to resolve to some form of a localhost IP. Is this check something that should be in our test suite? If you want to test for a public IP, you could just use any arbitrary non-localhost IP, right?

This function doesn't actually resolve names at all -- in this case it returns false because "example.com" is neither "localhost" nor a valid IP address.

In context, this means if your config has internalURLs like "http://localhost.example.com/" and that name resolves to a loopback address, the health checker will not recognize it as a loopback address, so if it can't connect, it will indicate an unhealthy/down/unreachable service instead of passing on the assumption that it refers to a loopback address on a different host.

Actions #24

Updated by Ward Vandewege about 2 months ago

Tom Clegg wrote:

Ward Vandewege wrote:

In sdk/go/health/aggregator_test.go, func (s *AggregatorSuite) TestIsLocalHost(c *check.C) {, this line seems a little dubious, since it introduces a dns resolution dependency on the example.com domain, and the test could also break if they ever change example.com to resolve to some form of a localhost IP. Is this check something that should be in our test suite? If you want to test for a public IP, you could just use any arbitrary non-localhost IP, right?

This function doesn't actually resolve names at all -- in this case it returns false because "example.com" is neither "localhost" nor a valid IP address.

In context, this means if your config has internalURLs like "http://localhost.example.com/" and that name resolves to a loopback address, the health checker will not recognize it as a loopback address, so if it can't connect, it will indicate an unhealthy/down/unreachable service instead of passing on the assumption that it refers to a loopback address on a different host.

Ah, of course, thanks for explaining.

18794-config-health @ 766d15a572d9e8a023e2877f9171422af530efa1 LGTM.

Actions #25

Updated by Tom Clegg about 2 months ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF