Feature #18794
closedcluster health check fails if some services are using different configs
Description
- Error message is “some services have mismatched configuration: serviceA, serviceB config file hash does not match latest version with timestamp X”
- Whether or not the health aggregator service is installed/configured, `arvados-server health` (when run on a cluster node) returns the same result the health aggregator would return, as text on stdout and a non-zero exit code to indicate fail.
- `arvados-server check` (new subcommand) loads config file, fetches all the individual services’ health checks just as the health aggregator service would, fetches the config metrics and compares hashes, and displays the result on stdout
- if all hashes match, config is OK, regardless of timestamps
- if some hashes don’t match, the most recent timestamp is assumed to be the real/intended config
- `arvados-server config-dump` output includes the content hash and timestamp of the source yaml file it loaded (timestamp is now if source file is a pipe)
- all services gain two “config” metrics:
- the hash and timestamp of the config file the current config was loaded from
- the hash and timestamp of the current config file on disk
- services also gain a "restarted at" or "config loaded at" timestamp metric with the last time the service restarted and/or reloaded its config
- RailsAPI and Workbench1: the only practical way to know which config versions are in use by various passenger workers is to implement auto-reload:
- immediately before loading config, and periodically in a background job, if tmp/restart.txt is older than the config file, touch restart.txt using the config file timestamp
- health check endpoint reports an error if the tmp/restart.txt file is older than the config file
- config version endpoint handler does not try to figure out whether all passenger workers use the same config as the one handling the request – if there’s a mismatch, assume it will be flagged by the health check endpoint
Note this does not address discrepancies in Nginx configuration in use / on disk, since it is maintained manually.
TBD: It is possible for a host to have an outdated config file on disk, even though all services on that host are using the same correct/latest config as other hosts (e.g., after an operator mistakenly copies an old config file to the host). How should this be reported so the operator can make sense of it – “some services have out-of-date configuration: serviceA (config file on disk); serviceB (config file on disk)”?
Related issues
Updated by Tom Clegg over 2 years ago
- Related to Feature #18768: Design for ability to check what config is in use across the cluster added
Updated by Peter Amstutz over 2 years ago
- Related to Idea #18727: Avoid configuration skew between different services and hosts added
Updated by Peter Amstutz over 2 years ago
- Target version set to 2022-04-13 Sprint
- Assigned To set to Tom Clegg
Updated by Tom Clegg over 2 years ago
- Status changed from New to In Progress
- Description updated (diff)
Updated by Tom Clegg over 2 years ago
- add arvados-server check
- add SourceTimestamp config field
Updated by Tom Clegg over 2 years ago
rebased
18794-config-health @ 90c7eb768f0550daead4a91f2381d042502487c4 -- developer-run-tests: #3030
Updated by Tom Clegg over 2 years ago
- Target version changed from 2022-04-13 Sprint to 2022-04-27 Sprint
Updated by Tom Clegg over 2 years ago
Currently `arvados-server check` prints a hunk of yaml, and exits 0 for OK. Suggestions for something more readable?
There are some pre-existing checks that don't have a human-readable summary, like "you don't have any keepstores configured". We probably need to resolve that before we can recommend using `arvados-server check`.
Maybe review/merge the "check config mismatches" first, then keep issue open to bring `arvados-server check` up to snuff, document it, etc.?
Either way, still todo:- auto-reload RailsAPI and Workbench1 as described above
- add metrics to Workbench1, similar to RailsAPI changes here
Updated by Peter Amstutz over 2 years ago
Do we have a way to query the Arvados version of all the services? I think that would be in scope for this feature.
Updated by Tom Clegg over 2 years ago
Peter Amstutz wrote:
Do we have a way to query the Arvados version of all the services? I think that would be in scope for this feature.
Easy to add that as a metric on the Go services. Not sure what we should do with the Rails apps and Workbench2.
Updated by Tom Clegg over 2 years ago
- RailsAPI auto-reloads config when config file changes (without checking whether the updated file is loadable)
- New version metric:
arvados_version_running{version="d9f70593c6a460abbb93514ac202724391148d0e-dev (go1.17.7)"} 1
(not sure about the naming -- suggestions?)
Updated by Peter Amstutz over 2 years ago
For rails apps
- report version that is installed (read package version number at startup time)
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-04-27 Sprint to 2022-05-11 sprint
Updated by Tom Clegg over 2 years ago
18794-config-health @ 35bb2d98244d40106c0493385f90cdd198c91447 -- developer-run-tests: #3078
The new railsRestartSuite.TestConfigReload test in lib/controller failed here, but so far works reliably on my machine. Haven't yet figured out whether it's flaky, or relying on something special about my dev box. Re-running at developer-run-tests-remainder: #3222
Updated by Tom Clegg over 2 years ago
18794-config-health @ fc4f842bfa62b4ab5e6a9d233eb88cc78758f376 -- developer-run-tests: #3093
retry wb1 developer-run-tests-apps-workbench-integration: #3307
example /metrics response from railsapi:
# HELP arvados_config_load_timestamp_seconds Time when config file was loaded. # TYPE arvados_config_load_timestamp_seconds gauge arvados_config_load_timestamp_seconds{sha256="d78120560ab0e9edcba75ff7a6e32cbc11ecce6867ddd3a0c29e2897888b5908"} 1651764701.407278 # HELP arvados_config_source_timestamp_seconds Timestamp of config file when it was loaded. # TYPE arvados_config_source_timestamp_seconds gauge arvados_config_source_timestamp_seconds{sha256="d78120560ab0e9edcba75ff7a6e32cbc11ecce6867ddd3a0c29e2897888b5908"} 1651764473.1509392 # HELP arvados_version_running Indicated version is running. # TYPE arvados_version_running gauge arvados_version_running{version="2.5.0~dev20220502191252"} 1
(Since note-11) arvados-server check
now silently exits 0 if there are no errors to report. There is a -yaml
option to output the detailed report (this could be useful for support requests).
Updated by Tom Clegg over 2 years ago
18794-config-health @ 00542ded608212e8245429001f1a0a0a736f2e71 -- developer-run-tests: #3095
Fixes a problem noticed while testing workbench-auto-reload on 9tee4 (if restart.txt exists and is owned by root, we can't change its modtime, we need to remove it and write a new one).
Updated by Ward Vandewege over 2 years ago
Tom Clegg wrote:
18794-config-health @ 00542ded608212e8245429001f1a0a0a736f2e71 -- developer-run-tests: #3095
Fixes a problem noticed while testing workbench-auto-reload on 9tee4 (if restart.txt exists and is owned by root, we can't change its modtime, we need to remove it and write a new one).
Just one small nit:
In sdk/go/health/aggregator_test.go
, func (s *AggregatorSuite) TestIsLocalHost(c *check.C) {
, this line seems a little dubious, since it introduces a dns resolution dependency on the example.com domain, and the test could also break if they ever change example.com
to resolve to some form of a localhost IP. Is this check something that should be in our test suite? If you want to test for a public IP, you could just use any arbitrary non-localhost IP, right?
+ c.Check(isLocalHost("example.com"), check.Equals, false)
Otherwise, LGTM!
Updated by Tom Clegg over 2 years ago
Ward Vandewege wrote:
In
sdk/go/health/aggregator_test.go
,func (s *AggregatorSuite) TestIsLocalHost(c *check.C) {
, this line seems a little dubious, since it introduces a dns resolution dependency on the example.com domain, and the test could also break if they ever changeexample.com
to resolve to some form of a localhost IP. Is this check something that should be in our test suite? If you want to test for a public IP, you could just use any arbitrary non-localhost IP, right?
This function doesn't actually resolve names at all -- in this case it returns false because "example.com" is neither "localhost" nor a valid IP address.
In context, this means if your config has internalURLs like "http://localhost.example.com/" and that name resolves to a loopback address, the health checker will not recognize it as a loopback address, so if it can't connect, it will indicate an unhealthy/down/unreachable service instead of passing on the assumption that it refers to a loopback address on a different host.
Updated by Tom Clegg over 2 years ago
18794-config-health @ 0dab89df8040f203a33bc1922df0ff893791def7 -- developer-run-tests: #3099
(merged main)
Updated by Tom Clegg over 2 years ago
18794-config-health @ 766d15a572d9e8a023e2877f9171422af530efa1 -- developer-run-tests: #3101
Updated by Ward Vandewege over 2 years ago
Tom Clegg wrote:
Ward Vandewege wrote:
In
sdk/go/health/aggregator_test.go
,func (s *AggregatorSuite) TestIsLocalHost(c *check.C) {
, this line seems a little dubious, since it introduces a dns resolution dependency on the example.com domain, and the test could also break if they ever changeexample.com
to resolve to some form of a localhost IP. Is this check something that should be in our test suite? If you want to test for a public IP, you could just use any arbitrary non-localhost IP, right?This function doesn't actually resolve names at all -- in this case it returns false because "example.com" is neither "localhost" nor a valid IP address.
In context, this means if your config has internalURLs like "http://localhost.example.com/" and that name resolves to a loopback address, the health checker will not recognize it as a loopback address, so if it can't connect, it will indicate an unhealthy/down/unreachable service instead of passing on the assumption that it refers to a loopback address on a different host.
Ah, of course, thanks for explaining.
18794-config-health @ 766d15a572d9e8a023e2877f9171422af530efa1 LGTM.
Updated by Tom Clegg over 2 years ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|9cdec66d5b0238e9d99c1e8e1d5c8571d798689d.