Feature #18794

Updated by Tom Clegg over 2 years ago

Overall health check returned by health aggregator returns “error” state unless the /etc/arvados/config.yml file is identical on all cluster hosts and all services are using that version 
 * Error message is “some services have mismatched out-of-date configuration: serviceA, serviceB config file hash does not match latest version with timestamp X” serviceB” 
 * Whether or not the health aggregator service is installed/configured, `arvados-server health` (when run on a cluster node) returns the same result the health aggregator would return, as text on stdout and a non-zero exit code to indicate fail. 

 * `arvados-server health` (new subcommand) loads config file, fetches all the individual services’ health checks just as the health aggregator service would, fetches the config metrics and compares hashes, and displays the result on stdout 
 ** if all hashes match, config is OK, regardless of timestamps 
 ** if some hashes don’t match, the most recent timestamp is assumed to be the real/intended config 
 * `arvados-server config-dump` output includes the content hash and timestamp of the source yaml file it loaded (timestamp is now if source file is a pipe) 
 * all services gain two “config” metrics: 
 ** the hash and timestamp of the config file the current config was loaded from 
 ** the hash and timestamp of the current config file on disk 
 * RailsAPI and Workbench1: the only practical way to know which config versions are in use by various passenger workers is to implement auto-reload: 
 ** immediately before loading config, and periodically in a background job, if tmp/restart.txt is older than the config file, touch restart.txt using the config file timestamp 
 ** health check endpoint reports an error if the tmp/restart.txt file is older than the config file 
 ** config version endpoint handler does not try to figure out whether all passenger workers use the same config as the one handling the request – if there’s a mismatch, assume it will be flagged by the health check endpoint 

 Note this does not address discrepancies in Nginx configuration in use / on disk, since it is maintained manually. 

 TBD: It is possible for a host to have an outdated config file on disk, even though all services on that host are using the same correct/latest config as other hosts (e.g., after an operator mistakenly copies an old config file to the host). How should this be reported so the operator can make sense of it – “some services have out-of-date configuration: serviceA (config file on disk); serviceB (config file on disk)”?