Feature #18794


cluster health check fails if some services are using different configs

Added by Tom Clegg over 2 years ago. Updated over 1 year ago.

Assigned To:
Target version:
Story points:
Release relationship:


Overall health check returned by health aggregator returns “error” state unless the /etc/arvados/config.yml file is identical on all cluster hosts and all services are using that version
  • Error message is “some services have mismatched configuration: serviceA, serviceB config file hash does not match latest version with timestamp X”
  • Whether or not the health aggregator service is installed/configured, `arvados-server health` (when run on a cluster node) returns the same result the health aggregator would return, as text on stdout and a non-zero exit code to indicate fail.
  • `arvados-server check` (new subcommand) loads config file, fetches all the individual services’ health checks just as the health aggregator service would, fetches the config metrics and compares hashes, and displays the result on stdout
    • if all hashes match, config is OK, regardless of timestamps
    • if some hashes don’t match, the most recent timestamp is assumed to be the real/intended config
  • `arvados-server config-dump` output includes the content hash and timestamp of the source yaml file it loaded (timestamp is now if source file is a pipe)
  • all services gain two “config” metrics:
    • the hash and timestamp of the config file the current config was loaded from
    • the hash and timestamp of the current config file on disk
  • services also gain a "restarted at" or "config loaded at" timestamp metric with the last time the service restarted and/or reloaded its config
  • RailsAPI and Workbench1: the only practical way to know which config versions are in use by various passenger workers is to implement auto-reload:
    • immediately before loading config, and periodically in a background job, if tmp/restart.txt is older than the config file, touch restart.txt using the config file timestamp
    • health check endpoint reports an error if the tmp/restart.txt file is older than the config file
    • config version endpoint handler does not try to figure out whether all passenger workers use the same config as the one handling the request – if there’s a mismatch, assume it will be flagged by the health check endpoint

Note this does not address discrepancies in Nginx configuration in use / on disk, since it is maintained manually.

TBD: It is possible for a host to have an outdated config file on disk, even though all services on that host are using the same correct/latest config as other hosts (e.g., after an operator mistakenly copies an old config file to the host). How should this be reported so the operator can make sense of it – “some services have out-of-date configuration: serviceA (config file on disk); serviceB (config file on disk)”?

Subtasks 1 (0 open1 closed)

Task #18952: Review 18794-config-healthResolvedWard Vandewege05/06/2022Actions

Related issues

Related to Arvados - Feature #18768: Design for ability to check what config is in use across the clusterResolvedTom CleggActions
Related to Arvados Epics - Idea #18727: Avoid configuration skew between different services and hostsResolved03/01/202205/31/2022Actions

Also available in: Atom PDF