Idea #20610
closedInstaller supports deploying a load-balancer to horizontally scale the controller node
Description
As a first implementation we'll use nginx
's load balancing features as it's already part of our stack and we know it pretty well.
Terraform code updates allow the optional creation of the load balancer node and updates Route53 accordingly.
When a balancer is requested, it takes the number of controller nodes to deploy (> 1) and also deploys a database
specific node.
Installer script & Salt code also accept a balancer
role that will install & configure nginx
, node-exporter
and not much else on the node, pointing nginx
to the controller nodes.
The installer script may need a new sub-command to do rolling upgrades by disabling traffic to & upgrading one controller node at a time.
Files
Updated by Peter Amstutz over 1 year ago
- Target version changed from Future to Development 2023-06-21 sprint
Updated by Lucas Di Pentima over 1 year ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-06-21 sprint to Development 2023-07-05 sprint
Updated by Lucas Di Pentima over 1 year ago
Updates at 4ac2b36 - branch: 20610-installer-load-balancer
Changes¶
- Adds a
balancer
role with itsnginx
&letsencrypt
confs. - Tweaks
controller
nginx config when a balancer is enabled. - Tweaks PG config to add multiple controller nodes (and dispatcher) IP addresses to its ACL.
- Fixes salt bootstraping as the version 3004 is no longer considered stable.
- Fixes ssh invocation to avoid forwarding environment variables that might confuse salt commands.
Pending¶
- Set up firewall rules on controller nodes so that they accept HTTP traffic only from the balancer.
- Add feature to temporarily disable controller nodes for rolling updates.
- Tweak prometheus/grafana configs to get controller nodes data.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Updated by Peter Amstutz over 1 year ago
From discussion: not needed to fully automate rolling upgrade, but do need to document how we would do a manual rolling upgrade.
Updated by Peter Amstutz over 1 year ago
- Related to Feature #20733: installer supports automated rolling upgrade of load-balanced nodes added
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-07-19 sprint to Development 2023-08-02 sprint
Updated by Lucas Di Pentima over 1 year ago
Updates at c68b24086
Test run: developer-run-tests-doc-and-sdk-R: #1903
- Changes how
arvados.sls
needs to be configured by getting all values fromlocal.params
. - Adds a
ROLES
variable that holds the role->nodes map from the configuration thatNODES
has. This allow us to DRY and auto-configure things without user explicit intervention. - Adds a
NODELIST
variable and fixes prometheus configuration to automatically set up any nodes' "node exporter" configurations instead of having a hard-coded config that requires manual intervention when deploying a "non-standard" cluster. - Further fixes prometheus config to include controller backend nodes on load-balanced scenarios.
- Restricts HTTP access to controller backends by adding
nginx
ACLs allowing only the load balancer and prometheus nodes. - Adds a
DISABLED_CONTROLLER
variable that the user can populate with a controller backend node name so that it gets kicked out of the backend pool, for rolling upgrades. - Adds & fixes documentation reflecting all the above changes.
Updated by Peter Amstutz over 1 year ago
Reviewing 20610-installer-load-balancer @ c68b2408668c3bc2092bc7bc372a04154216c52c
In the documentation update, the load balancing section, in the example terraform.tfvars
it would be helpful to include comments explaining why you need to set internal_service_hosts
, private_ip
and dns_aliases
.
I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?
In "Rolling upgrades procedure" could we walk through the entire process? In other words, list all this out:
- set DISABLED_CONTROLLER=controller1
- redeploy the balancer
- deploy controller1
- set DISABLED_CONTROLLER=controller2
- redeploy the balancer
- deploy controller2
- clear DISABLED_CONTROLLER
- redeploy the balancer
Should also try to confirm somehow that redeploying the balancer doesn't interfere with active requests.
As a matter of salt template style, moving the substitutions to the top with {% set foobar = "__FOOBAR__" %}
feels like a good change that we should do everywhere.
Setting NODELIST, ROLES and ENABLE_BALANCER -- this is generic logic, could this go into installer.sh loadoconfig() ?
- The mapping of roles to nodes. This is used to dinamically adjust
"dynamically" is misspelled.
In provision.sh, could we put the huge sed
invocation over SOURCE_PILLARS_DIR and SOURCE_STATES_DIR into a function so that all the substitutions only need to be listed once?
This a pre-existing style but I see ${S_DIR}/top.sls
and ${P_DIR}/top.sls
repeated dozens of times. It would be an easy search and replace to change those to something like ${STATES_TOP}
and ${PILLARS_TOP}
that would be more readable.
Do we really need separate api
and controller
roles? I think we always have them glued together? Can we drop "api" and just have "controller" that always sets up both? -- To put in another way, the only reason to have controller and api split is if you wanted them on separate nodes for scalability reasons, but we're specifically handling scalability by horizontally scaling controller+api as a unit, so there's no situation in the foreseeable future where splitting them would make more sense than adding more controller+api servers horizontally.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Updated by Lucas Di Pentima over 1 year ago
Peter Amstutz wrote in #note-11:
As a matter of salt template style, moving the substitutions to the top with
{% set foobar = "__FOOBAR__" %}
feels like a good change that we should do everywhere.
I was thinking of having one pillar file with all the replacements, and all the rest imports them as we do in nginx_*_configuration.sls
files. This way we would really reduce the number of duplicated variable settings. Not sure if we should block this story on that, though.
Updated by Peter Amstutz over 1 year ago
Lucas Di Pentima wrote in #note-13:
Peter Amstutz wrote in #note-11:
As a matter of salt template style, moving the substitutions to the top with
{% set foobar = "__FOOBAR__" %}
feels like a good change that we should do everywhere.I was thinking of having one pillar file with all the replacements, and all the rest imports them as we do in
nginx_*_configuration.sls
files. This way we would really reduce the number of duplicated variable settings. Not sure if we should block this story on that, though.
I had a similar thought, but I didn't include it in the comments because I didn't want this to turn into a general "refactor the installer" branch. So I agree we should do it but not block this story.
Updated by Lucas Di Pentima over 1 year ago
- File siege_report.png siege_report.png added
- File concurrent_reqs.png concurrent_reqs.png added
Updates at f1d43dafa
- Moves code from
local.params
tocommon.sh
, and use that to load all settings fromprovision.sh
andinstaller.sh
- Renames the
ROLES
map toROLE2NODES
to avoid clashing with another preexistingROLES
variable. - Sets
provision.sh
to exit on error and unbound variables (set -eu
). This will reflect in failing deploys returning earlier and more obviously. - Deduplicates variable substitution code on pillars and custom state files.
- Search&Replace on
top.sls
file references for readability. - Unifies the
api
role withcontroller
as both roles always should be part of the same node. - Fixes balancer's nginx request queue config depending in the amount of backends currently active.
I've being doing tests with siege
to check if there's some service interruption, and it doesn't seem to happen. Here's an example of a 100 concurrent request test against backends configured with that same amount of MaxRequests:
The test started with one controller disabled, then disabled the other one, and then left both backend working. siege
's report was successful:
Pending¶
- Documentation fixes
Updated by Lucas Di Pentima over 1 year ago
Updates at d9152312a
- Documentation fixes
Updated by Peter Amstutz over 1 year ago
I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?
I think you overlooked this one (or didn't respond with a reason not to do it).
Updated by Lucas Di Pentima over 1 year ago
Peter Amstutz wrote in #note-17:
I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?
I think you overlooked this one (or didn't respond with a reason not to do it).
Sorry for not clarifying my inaction earlier. Here's why: I think simplifying changes for different uses could mean we're always tweaking things when we add new stuff. Testing the new default would've taken a lot of work, and I didn't see it as worth the simplification in the instructions.
But, if you still think it's the way to go, I won't argue. I do see the value in showing how to tinker with configurations in our doc.
Updated by Peter Amstutz over 1 year ago
Lucas Di Pentima wrote in #note-18:
Peter Amstutz wrote in #note-17:
I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?
I think you overlooked this one (or didn't respond with a reason not to do it).
Sorry for not clarifying my inaction earlier. Here's why: I think simplifying changes for different uses could mean we're always tweaking things when we add new stuff. Testing the new default would've taken a lot of work, and I didn't see it as worth the simplification in the instructions.
But, if you still think it's the way to go, I won't argue. I do see the value in showing how to tinker with configurations in our doc.
Well, the other reason is that bad behavior of both keep-balance and arvados-dispatch-cloud have interfered with controller in the past, which is another reason to isolate them.
Updated by Lucas Di Pentima over 1 year ago
Peter Amstutz wrote in #note-19:
Lucas Di Pentima wrote in #note-18:
Peter Amstutz wrote in #note-17:
I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?
I think you overlooked this one (or didn't respond with a reason not to do it).
Sorry for not clarifying my inaction earlier. Here's why: I think simplifying changes for different uses could mean we're always tweaking things when we add new stuff. Testing the new default would've taken a lot of work, and I didn't see it as worth the simplification in the instructions.
But, if you still think it's the way to go, I won't argue. I do see the value in showing how to tinker with configurations in our doc.
Well, the other reason is that bad behavior of both keep-balance and arvados-dispatch-cloud have interfered with controller in the past, which is another reason to isolate them.
Ok, will get to that. Thanks!
Updated by Lucas Di Pentima over 1 year ago
The arvados-ws
& PostgreSQL permission issue that I encountered seems to be related to a reported bug in the PostgreSQL formula: we're getting the postgresql-15
installed along the requested version postgresql-12
.
So, if v15 starts first, it will take the 5432 port and won't have the correct conf files and won't allow connections. This doesn't always happen.
https://github.com/saltstack-formulas/postgres-formula/issues/324
I think the correct thing to do is to fix the issue on our formula fork at least, instead of writing some workaround on our installed salt code. Do you agree?
Updated by Peter Amstutz over 1 year ago
Lucas Di Pentima wrote in #note-21:
The
arvados-ws
& PostgreSQL permission issue that I encountered seems to be related to a reported bug in the PostgreSQL formula: we're getting thepostgresql-15
installed along the requested versionpostgresql-12
.
So, if v15 starts first, it will take the 5432 port and won't have the correct conf files and won't allow connections. This doesn't always happen.https://github.com/saltstack-formulas/postgres-formula/issues/324
I think the correct thing to do is to fix the issue on our formula fork at least, instead of writing some workaround on our installed salt code. Do you agree?
Yes.
Updated by Lucas Di Pentima over 1 year ago
Update on the above comment: Although a 2nd PG is getting installed, it really wasn't causing the permission issue, it was just a configuration error on my part. The fix is needed but it's not urgent.
Updates at ea3ab8451
- Changes the default installer cluster shape so that the
controller
node only getsdatabase
andcontroller
nodes, and moveskeepbalance
,websocket
anddispatcher
to theworkbench
node. - Fixes the PG config file template to include the websocket IP on its ACL instead of the dispatcher's.
- Updates the documentation simplifying the load-balancer instructions accordingly and also fixes some other issues related to a previous default cluster simplification change.
Updated by Lucas Di Pentima over 1 year ago
I've migrated the scale
cluster with the current installer version: 2 load-balanced controller nodes, the rest is the same.
It was a pretty painless experience: just destroyed the controller
node and recreated it with a smaller filesystem, created the new controller{1,2}
nodes via terraform, and run deploy.
Had to merge a last minute commit that fixes the instance profile assignment for the dispatcher, now that it's on the workbench
node.
One interesting detail is when a cloud instance gets a new instance profilem it needs to get restarted for the cloud to notice the change.
Updated by Lucas Di Pentima over 1 year ago
- Status changed from In Progress to Resolved
Updated by Brett Smith over 1 year ago
- Related to Bug #20934: test-provision-centos7 is failing added