Project

General

Profile

Actions

Story #18256

closed

Design bottom-up configuration/discovery strategy

Added by Peter Amstutz 11 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Story points:
-

Description

  • Investigate orchestration

Certain configuration steps have dependencies on configuration of other nodes, which requires orchestration rather than just top-down configuration.

e.g. need to know how to contact postgres

Javier: want bottom-up configuration, when services come up they contact the configuration server (consul?) to get the configuration & update the service entry.

Tom: "join" command (state?) to add a service to Arvados

Use pre-shared key / token for machines to identify themselves. Or nodes generate their own random ID and there's an approval step.

Tom: would be cool if nodes can self-configure which services they run

"Join" state gets a list of services that node with this unique ID should be running, can be changed on the fly.

Stephen: could have a discovery mode

Ward: restrict to private network

Javier: Controller should hold the central configuration


Related issues

Related to Arvados Epics - Story #18727: Avoid configuration skew between different services and hostsResolved03/01/202205/31/2022

Actions
Actions #1

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz 11 months ago

  • Description updated (diff)
  • Subject changed from Better provision script to Design bottom-up configuration/discovery strategy
Actions #3

Updated by Peter Amstutz 6 months ago

  • Target version set to 2022-03-02 sprint
Actions #4

Updated by Peter Amstutz 6 months ago

Specific challenges

  • Install script works for installing
  • Next thing people need to do is tweak the configuration
  • Involves updating the config file & distributing the new copy to nodes & restarting the appropriate services
    • Doing this correctly is not so obvious for novice admins
  • Need a recommended/supported method for managing the config centrally

Goals

  • "arv edit" on the config (on saving valid config, it is distributed to all the nodes automatically)
  • Services reload the config automatically on change
  • If config is missing or broken on service start, services idle and wait for a valid config to show up instead of exiting with an error

Implementation ideas *

Actions #5

Updated by Tom Clegg 6 months ago

There's a lot of possible stuff to do here. For sake of discussion I'll propose one specific feature to implement first:

Operator can run an "update cluster configuration to X" command.

Something like

arvados-server config-update ./new-config-file.yml

and/or

arvados-server config-edit    # works like 'arv edit'
Acceptable constraints that will be acceptable in the first version to help us get it off the ground, although we intend to fix them in future versions:
  • Operator must be logged in to a working system node to do this. (In future versions we will want to support / recommend doing this from outside -- an orchestration machine, the operator's own computer, etc.)
  • If some of the system nodes are currently not reachable from the host where the config-update command runs, those nodes do not get updated, and the operator is warned that they need to re-run the config-update command to sync those nodes. (In future versions the nodes will automatically sync up when connectivity is restored.)
  • Assuming the update is not a no-op (i.e., the new config actually differs from the old config), all services get restarted whether or not it's strictly necessary. (In future only the services that actually use the changed configs will get restarted, signalled to reload their config without restarting, etc.)
The idea behind these constraints is that we can implement a config-update command that is easy to use, and reliable in the two obvious cases, i.e.,
  • the operator is a human, who sees "couldn't update X" error messages and follows up by fixing/retrying until it works
  • the operator is orchestration software, which sees a non-zero exit code, reports the failure to a human somehow, and automatically retries until it works

The "config-edit" command is slightly more complicated in that (with the basic implementation described above) we can't guarantee that we can access the most recent config (e.g., on a 2-node cluster, config is edited while server B is offline, then edited again while server A is offline → config-edit would prompt the operator to edit an old config, losing the first round of changes).

Actions #6

Updated by Peter Amstutz 6 months ago

Responding to config changes

  • on start, service checks for config. if missing or broken, service idles and continues checking until it gets a good config file.
  • during operation, service checks for config changes. if the config changes, the service loads the config file and validates it. if it is valid, the service restarts. if the config is not valid, the service does not restart, but it adds a health check warning that the config file is bad.
  • prometheus metric reports 0 or 1 whether the config on disk matches the config in memory
  • health check reports md5sum and timestamp of the config file on disk
    • health check aggregator can check if the sums don't match
  • add a command line tool to arvados-client which reports the health check results of all the services
  • the public config published by controller should include a timestamp for config last modified time
Actions #7

Updated by Tom Clegg 6 months ago

  • Related to Story #18727: Avoid configuration skew between different services and hosts added
Actions #8

Updated by Peter Amstutz 6 months ago

  • Status changed from New to Resolved
Actions #9

Updated by Peter Amstutz 6 months ago

  • Target version changed from 2022-03-02 sprint to 2022-02-16 sprint
Actions

Also available in: Atom PDF