Idea #18256
closedDesign bottom-up configuration/discovery strategy
Description
- Investigate orchestration
Certain configuration steps have dependencies on configuration of other nodes, which requires orchestration rather than just top-down configuration.
e.g. need to know how to contact postgres
Javier: want bottom-up configuration, when services come up they contact the configuration server (consul?) to get the configuration & update the service entry.
Tom: "join" command (state?) to add a service to Arvados
Use pre-shared key / token for machines to identify themselves. Or nodes generate their own random ID and there's an approval step.
Tom: would be cool if nodes can self-configure which services they run
"Join" state gets a list of services that node with this unique ID should be running, can be changed on the fly.
Stephen: could have a discovery mode
Ward: restrict to private network
Javier: Controller should hold the central configuration
Updated by Peter Amstutz over 3 years ago
- Description updated (diff)
- Subject changed from Better provision script to Design bottom-up configuration/discovery strategy
Updated by Peter Amstutz about 3 years ago
- Target version set to 2022-03-02 sprint
Updated by Peter Amstutz about 3 years ago
Specific challenges
- Install script works for installing
- Next thing people need to do is tweak the configuration
- Involves updating the config file & distributing the new copy to nodes & restarting the appropriate services
- Doing this correctly is not so obvious for novice admins
- Need a recommended/supported method for managing the config centrally
Goals
- "arv edit" on the config (on saving valid config, it is distributed to all the nodes automatically)
- Services reload the config automatically on change
- If config is missing or broken on service start, services idle and wait for a valid config to show up instead of exiting with an error
Implementation ideas *
Updated by Tom Clegg about 3 years ago
There's a lot of possible stuff to do here. For sake of discussion I'll propose one specific feature to implement first:
Operator can run an "update cluster configuration to X" command.
Something like
arvados-server config-update ./new-config-file.yml
and/or
arvados-server config-edit # works like 'arv edit'
Acceptable constraints that will be acceptable in the first version to help us get it off the ground, although we intend to fix them in future versions:
- Operator must be logged in to a working system node to do this. (In future versions we will want to support / recommend doing this from outside -- an orchestration machine, the operator's own computer, etc.)
- If some of the system nodes are currently not reachable from the host where the config-update command runs, those nodes do not get updated, and the operator is warned that they need to re-run the config-update command to sync those nodes. (In future versions the nodes will automatically sync up when connectivity is restored.)
- Assuming the update is not a no-op (i.e., the new config actually differs from the old config), all services get restarted whether or not it's strictly necessary. (In future only the services that actually use the changed configs will get restarted, signalled to reload their config without restarting, etc.)
- the operator is a human, who sees "couldn't update X" error messages and follows up by fixing/retrying until it works
- the operator is orchestration software, which sees a non-zero exit code, reports the failure to a human somehow, and automatically retries until it works
The "config-edit" command is slightly more complicated in that (with the basic implementation described above) we can't guarantee that we can access the most recent config (e.g., on a 2-node cluster, config is edited while server B is offline, then edited again while server A is offline → config-edit would prompt the operator to edit an old config, losing the first round of changes).
Updated by Peter Amstutz about 3 years ago
Responding to config changes
- on start, service checks for config. if missing or broken, service idles and continues checking until it gets a good config file.
- during operation, service checks for config changes. if the config changes, the service loads the config file and validates it. if it is valid, the service restarts. if the config is not valid, the service does not restart, but it adds a health check warning that the config file is bad.
- prometheus metric reports 0 or 1 whether the config on disk matches the config in memory
- health check reports md5sum and timestamp of the config file on disk
- health check aggregator can check if the sums don't match
- add a command line tool to arvados-client which reports the health check results of all the services
- the public config published by controller should include a timestamp for config last modified time
Updated by Tom Clegg about 3 years ago
- Related to Idea #18727: Avoid configuration skew between different services and hosts added
Updated by Peter Amstutz about 3 years ago
- Target version changed from 2022-03-02 sprint to 2022-02-16 sprint