Project

General

Profile

Actions

Idea #18727

closed

Avoid configuration skew between different services and hosts

Added by Tom Clegg almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Target version:
-
Start date:
03/01/2022
Due date:
05/31/2022
Story points:
-
Release:
Release relationship:
Auto

Description

Background: With multiple back-end service components running on multiple hosts, it is possible to have services running with different configurations. In many cases, this happens by accident, and ends up causing problems for users/clients that are hard to diagnose.

Examples:
  • If RailsAPI is not restarted after changing /etc/arvados/config.yml, it will continue using the old config -- except that when passenger starts new worker threads, they use the new config.
  • If the instance types are updated, and controller is restarted but arvados-dispatch-cloud is not restarted, clients will see that the updated types are available, but scheduling decisions will be made based on the old types.
  • If a Keep volume changes from read-only to read-write, and controller/RailsAPI are restarted but the relevant keepstore processes are not restarted, clients will waste time trying to write to the volume (which keepstore will refuse to do) before falling back to different volumes/servers.
There are two main things we can do to minimize occurrences of such problems:
  1. Automatically detect when a version mismatch exists, and report this to the operator (via logs, health checks, metrics)
  2. Provide an easy mechanism for updating the configuration cluster-wide and signalling all services to restart/reload config as needed, thereby eliminating the most common causes of version mismatches (i.e., the operator fails to update config on all nodes or incorrectly identifies which services need to be restarted)

Related issues 5 (1 open4 closed)

Related to Arvados - Idea #18256: Design bottom-up configuration/discovery strategyResolvedActions
Related to Arvados Epics - Idea #18685: Synchronize configuration on multi-node clusterNewActions
Related to Arvados - Feature #18768: Design for ability to check what config is in use across the clusterResolvedTom CleggActions
Related to Arvados - Bug #16345: Health check checks for clock and version skewResolvedTom Clegg05/11/2022Actions
Related to Arvados - Feature #18794: cluster health check fails if some services are using different configsResolvedTom Clegg05/06/2022Actions
Actions #1

Updated by Tom Clegg almost 3 years ago

  • Related to Idea #18256: Design bottom-up configuration/discovery strategy added
Actions #2

Updated by Tom Clegg almost 3 years ago

  • Related to Idea #18685: Synchronize configuration on multi-node cluster added
Actions #3

Updated by Peter Amstutz almost 3 years ago

  • Related to Feature #18768: Design for ability to check what config is in use across the cluster added
Actions #4

Updated by Peter Amstutz almost 3 years ago

  • Start date set to 03/01/2022
  • Due date set to 05/31/2022
Actions #5

Updated by Peter Amstutz almost 3 years ago

  • Related to Bug #16345: Health check checks for clock and version skew added
Actions #6

Updated by Peter Amstutz over 2 years ago

  • Related to Feature #18794: cluster health check fails if some services are using different configs added
Actions #7

Updated by Peter Amstutz over 2 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF