Project

General

Profile

Actions

Story #18727

closed

Avoid configuration skew between different services and hosts

Added by Tom Clegg 5 months ago. Updated 28 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Target version:
-
Start date:
03/01/2022
Due date:
05/31/2022
% Done:

0%

Estimated time:
Story points:
-
Release:
Release relationship:
Auto

Description

Background: With multiple back-end service components running on multiple hosts, it is possible to have services running with different configurations. In many cases, this happens by accident, and ends up causing problems for users/clients that are hard to diagnose.

Examples:
  • If RailsAPI is not restarted after changing /etc/arvados/config.yml, it will continue using the old config -- except that when passenger starts new worker threads, they use the new config.
  • If the instance types are updated, and controller is restarted but arvados-dispatch-cloud is not restarted, clients will see that the updated types are available, but scheduling decisions will be made based on the old types.
  • If a Keep volume changes from read-only to read-write, and controller/RailsAPI are restarted but the relevant keepstore processes are not restarted, clients will waste time trying to write to the volume (which keepstore will refuse to do) before falling back to different volumes/servers.
There are two main things we can do to minimize occurrences of such problems:
  1. Automatically detect when a version mismatch exists, and report this to the operator (via logs, health checks, metrics)
  2. Provide an easy mechanism for updating the configuration cluster-wide and signalling all services to restart/reload config as needed, thereby eliminating the most common causes of version mismatches (i.e., the operator fails to update config on all nodes or incorrectly identifies which services need to be restarted)

Related issues

Related to Arvados - Story #18256: Design bottom-up configuration/discovery strategyResolved

Actions
Related to Arvados Epics - Story #18685: Synchronize configuration on multi-node clusterNew07/01/202210/31/2022

Actions
Related to Arvados - Feature #18768: Design for ability to check what config is in use across the clusterResolvedTom Clegg

Actions
Related to Arvados - Bug #16345: Health check checks for clock and version skewResolvedTom Clegg05/11/2022

Actions
Related to Arvados - Feature #18794: cluster health check fails if some services are using different configsResolvedTom Clegg05/06/2022

Actions
Actions #1

Updated by Tom Clegg 5 months ago

  • Related to Story #18256: Design bottom-up configuration/discovery strategy added
Actions #2

Updated by Tom Clegg 5 months ago

  • Related to Story #18685: Synchronize configuration on multi-node cluster added
Actions #3

Updated by Peter Amstutz 4 months ago

  • Related to Feature #18768: Design for ability to check what config is in use across the cluster added
Actions #4

Updated by Peter Amstutz 4 months ago

  • Start date set to 03/01/2022
  • Due date set to 05/31/2022
Actions #5

Updated by Peter Amstutz 4 months ago

  • Related to Bug #16345: Health check checks for clock and version skew added
Actions #6

Updated by Peter Amstutz 3 months ago

  • Related to Feature #18794: cluster health check fails if some services are using different configs added
Actions #7

Updated by Peter Amstutz 28 days ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF