Project

General

Profile

Actions

Idea #18727

closed

Avoid configuration skew between different services and hosts

Added by Tom Clegg about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Target version:
-
Story points:
-
Release:
Release relationship:
Auto

Description

Background: With multiple back-end service components running on multiple hosts, it is possible to have services running with different configurations. In many cases, this happens by accident, and ends up causing problems for users/clients that are hard to diagnose.

Examples:
  • If RailsAPI is not restarted after changing /etc/arvados/config.yml, it will continue using the old config -- except that when passenger starts new worker threads, they use the new config.
  • If the instance types are updated, and controller is restarted but arvados-dispatch-cloud is not restarted, clients will see that the updated types are available, but scheduling decisions will be made based on the old types.
  • If a Keep volume changes from read-only to read-write, and controller/RailsAPI are restarted but the relevant keepstore processes are not restarted, clients will waste time trying to write to the volume (which keepstore will refuse to do) before falling back to different volumes/servers.
There are two main things we can do to minimize occurrences of such problems:
  1. Automatically detect when a version mismatch exists, and report this to the operator (via logs, health checks, metrics)
  2. Provide an easy mechanism for updating the configuration cluster-wide and signalling all services to restart/reload config as needed, thereby eliminating the most common causes of version mismatches (i.e., the operator fails to update config on all nodes or incorrectly identifies which services need to be restarted)

Related issues

Related to Arvados - Idea #18256: Design bottom-up configuration/discovery strategyResolvedActions
Related to Arvados Epics - Idea #18685: Synchronize configuration on multi-node clusterNewActions
Related to Arvados - Feature #18768: Design for ability to check what config is in use across the clusterResolvedTom CleggActions
Related to Arvados - Bug #16345: Health check checks for clock and version skewResolvedTom Clegg05/11/2022Actions
Related to Arvados - Feature #18794: cluster health check fails if some services are using different configsResolvedTom Clegg05/06/2022Actions
Actions

Also available in: Atom PDF