Project

General

Profile

Actions

Idea #20610

closed

Installer supports deploying a load-balancer to horizontally scale the controller node

Added by Lucas Di Pentima 11 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Start date:
07/28/2023
Due date:
Story points:
-
Release relationship:
Auto

Description

As a first implementation we'll use nginx's load balancing features as it's already part of our stack and we know it pretty well.

Terraform code updates allow the optional creation of the load balancer node and updates Route53 accordingly.
When a balancer is requested, it takes the number of controller nodes to deploy (> 1) and also deploys a database specific node.

Installer script & Salt code also accept a balancer role that will install & configure nginx, node-exporter and not much else on the node, pointing nginx to the controller nodes.
The installer script may need a new sub-command to do rolling upgrades by disabling traffic to & upgrading one controller node at a time.


Files

siege_report.png (89.8 KB) siege_report.png Lucas Di Pentima, 08/03/2023 08:11 PM
concurrent_reqs.png (86.8 KB) concurrent_reqs.png Lucas Di Pentima, 08/03/2023 08:11 PM

Subtasks 1 (0 open1 closed)

Task #20676: Review 20610-installer-load-balancerResolvedPeter Amstutz07/28/2023Actions

Related issues

Related to Arvados - Feature #20733: installer supports automated rolling upgrade of load-balanced nodesNewActions
Related to Arvados - Bug #20934: test-provision-centos7 is failingClosedBrett Smith09/08/2023Actions
Actions #1

Updated by Peter Amstutz 11 months ago

  • Target version changed from Future to Development 2023-06-21 sprint
Actions #2

Updated by Peter Amstutz 11 months ago

  • Assigned To set to Lucas Di Pentima
Actions #3

Updated by Lucas Di Pentima 11 months ago

  • Status changed from New to In Progress
Actions #4

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2023-06-21 sprint to Development 2023-07-05 sprint
Actions #5

Updated by Lucas Di Pentima 10 months ago

Updates at 4ac2b36 - branch: 20610-installer-load-balancer

Changes

  • Adds a balancer role with its nginx & letsencrypt confs.
  • Tweaks controller nginx config when a balancer is enabled.
  • Tweaks PG config to add multiple controller nodes (and dispatcher) IP addresses to its ACL.
  • Fixes salt bootstraping as the version 3004 is no longer considered stable.
  • Fixes ssh invocation to avoid forwarding environment variables that might confuse salt commands.

Pending

  • Set up firewall rules on controller nodes so that they accept HTTP traffic only from the balancer.
  • Add feature to temporarily disable controller nodes for rolling updates.
  • Tweak prometheus/grafana configs to get controller nodes data.
Actions #6

Updated by Peter Amstutz 10 months ago

  • Target version changed from Development 2023-07-05 sprint to Development 2023-07-19 sprint
Actions #7

Updated by Peter Amstutz 10 months ago

From discussion: not needed to fully automate rolling upgrade, but do need to document how we would do a manual rolling upgrade.

Actions #8

Updated by Peter Amstutz 10 months ago

  • Related to Feature #20733: installer supports automated rolling upgrade of load-balanced nodes added
Actions #9

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2023-07-19 sprint to Development 2023-08-02 sprint
Actions #10

Updated by Lucas Di Pentima 9 months ago

Updates at c68b24086
Test run: developer-run-tests-doc-and-sdk-R: #1903

  • Changes how arvados.sls needs to be configured by getting all values from local.params.
  • Adds a ROLES variable that holds the role->nodes map from the configuration that NODES has. This allow us to DRY and auto-configure things without user explicit intervention.
  • Adds a NODELIST variable and fixes prometheus configuration to automatically set up any nodes' "node exporter" configurations instead of having a hard-coded config that requires manual intervention when deploying a "non-standard" cluster.
  • Further fixes prometheus config to include controller backend nodes on load-balanced scenarios.
  • Restricts HTTP access to controller backends by adding nginx ACLs allowing only the load balancer and prometheus nodes.
  • Adds a DISABLED_CONTROLLER variable that the user can populate with a controller backend node name so that it gets kicked out of the backend pool, for rolling upgrades.
  • Adds & fixes documentation reflecting all the above changes.
Actions #11

Updated by Peter Amstutz 9 months ago

Reviewing 20610-installer-load-balancer @ c68b2408668c3bc2092bc7bc372a04154216c52c

In the documentation update, the load balancing section, in the example terraform.tfvars it would be helpful to include comments explaining why you need to set internal_service_hosts, private_ip and dns_aliases.

I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?

In "Rolling upgrades procedure" could we walk through the entire process? In other words, list all this out:

  1. set DISABLED_CONTROLLER=controller1
  2. redeploy the balancer
  3. deploy controller1
  4. set DISABLED_CONTROLLER=controller2
  5. redeploy the balancer
  6. deploy controller2
  7. clear DISABLED_CONTROLLER
  8. redeploy the balancer

Should also try to confirm somehow that redeploying the balancer doesn't interfere with active requests.

As a matter of salt template style, moving the substitutions to the top with {% set foobar = "__FOOBAR__" %} feels like a good change that we should do everywhere.

Setting NODELIST, ROLES and ENABLE_BALANCER -- this is generic logic, could this go into installer.sh loadoconfig() ?

  1. The mapping of roles to nodes. This is used to dinamically adjust

"dynamically" is misspelled.

In provision.sh, could we put the huge sed invocation over SOURCE_PILLARS_DIR and SOURCE_STATES_DIR into a function so that all the substitutions only need to be listed once?

This a pre-existing style but I see ${S_DIR}/top.sls and ${P_DIR}/top.sls repeated dozens of times. It would be an easy search and replace to change those to something like ${STATES_TOP} and ${PILLARS_TOP} that would be more readable.

Do we really need separate api and controller roles? I think we always have them glued together? Can we drop "api" and just have "controller" that always sets up both? -- To put in another way, the only reason to have controller and api split is if you wanted them on separate nodes for scalability reasons, but we're specifically handling scalability by horizontally scaling controller+api as a unit, so there's no situation in the foreseeable future where splitting them would make more sense than adding more controller+api servers horizontally.

Actions #12

Updated by Peter Amstutz 9 months ago

  • Target version changed from Development 2023-08-02 sprint to Development 2023-08-16
Actions #13

Updated by Lucas Di Pentima 9 months ago

Peter Amstutz wrote in #note-11:

As a matter of salt template style, moving the substitutions to the top with {% set foobar = "__FOOBAR__" %} feels like a good change that we should do everywhere.

I was thinking of having one pillar file with all the replacements, and all the rest imports them as we do in nginx_*_configuration.sls files. This way we would really reduce the number of duplicated variable settings. Not sure if we should block this story on that, though.

Actions #14

Updated by Peter Amstutz 9 months ago

Lucas Di Pentima wrote in #note-13:

Peter Amstutz wrote in #note-11:

As a matter of salt template style, moving the substitutions to the top with {% set foobar = "__FOOBAR__" %} feels like a good change that we should do everywhere.

I was thinking of having one pillar file with all the replacements, and all the rest imports them as we do in nginx_*_configuration.sls files. This way we would really reduce the number of duplicated variable settings. Not sure if we should block this story on that, though.

I had a similar thought, but I didn't include it in the comments because I didn't want this to turn into a general "refactor the installer" branch. So I agree we should do it but not block this story.

Actions #15

Updated by Lucas Di Pentima 9 months ago

Updates at f1d43dafa

  • Moves code from local.params to common.sh, and use that to load all settings from provision.sh and installer.sh
  • Renames the ROLES map to ROLE2NODES to avoid clashing with another preexisting ROLES variable.
  • Sets provision.sh to exit on error and unbound variables (set -eu). This will reflect in failing deploys returning earlier and more obviously.
  • Deduplicates variable substitution code on pillars and custom state files.
  • Search&Replace on top.sls file references for readability.
  • Unifies the api role with controller as both roles always should be part of the same node.
  • Fixes balancer's nginx request queue config depending in the amount of backends currently active.

I've being doing tests with siege to check if there's some service interruption, and it doesn't seem to happen. Here's an example of a 100 concurrent request test against backends configured with that same amount of MaxRequests:

The test started with one controller disabled, then disabled the other one, and then left both backend working. siege's report was successful:

Pending

  • Documentation fixes
Actions #16

Updated by Lucas Di Pentima 9 months ago

Updates at d9152312a

  • Documentation fixes
Actions #17

Updated by Peter Amstutz 9 months ago

I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?

I think you overlooked this one (or didn't respond with a reason not to do it).

Actions #18

Updated by Lucas Di Pentima 9 months ago

Peter Amstutz wrote in #note-17:

I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?

I think you overlooked this one (or didn't respond with a reason not to do it).

Sorry for not clarifying my inaction earlier. Here's why: I think simplifying changes for different uses could mean we're always tweaking things when we add new stuff. Testing the new default would've taken a lot of work, and I didn't see it as worth the simplification in the instructions.

But, if you still think it's the way to go, I won't argue. I do see the value in showing how to tinker with configurations in our doc.

Actions #19

Updated by Peter Amstutz 9 months ago

Lucas Di Pentima wrote in #note-18:

Peter Amstutz wrote in #note-17:

I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?

I think you overlooked this one (or didn't respond with a reason not to do it).

Sorry for not clarifying my inaction earlier. Here's why: I think simplifying changes for different uses could mean we're always tweaking things when we add new stuff. Testing the new default would've taken a lot of work, and I didn't see it as worth the simplification in the instructions.

But, if you still think it's the way to go, I won't argue. I do see the value in showing how to tinker with configurations in our doc.

Well, the other reason is that bad behavior of both keep-balance and arvados-dispatch-cloud have interfered with controller in the past, which is another reason to isolate them.

Actions #20

Updated by Lucas Di Pentima 9 months ago

Peter Amstutz wrote in #note-19:

Lucas Di Pentima wrote in #note-18:

Peter Amstutz wrote in #note-17:

I'm wondering if it would make sense to move websocket, dispatcher, and keepbalance over to the workbench in the default configuration, then the example load balanced configuration would have fewer differences?

I think you overlooked this one (or didn't respond with a reason not to do it).

Sorry for not clarifying my inaction earlier. Here's why: I think simplifying changes for different uses could mean we're always tweaking things when we add new stuff. Testing the new default would've taken a lot of work, and I didn't see it as worth the simplification in the instructions.

But, if you still think it's the way to go, I won't argue. I do see the value in showing how to tinker with configurations in our doc.

Well, the other reason is that bad behavior of both keep-balance and arvados-dispatch-cloud have interfered with controller in the past, which is another reason to isolate them.

Ok, will get to that. Thanks!

Actions #21

Updated by Lucas Di Pentima 9 months ago

The arvados-ws & PostgreSQL permission issue that I encountered seems to be related to a reported bug in the PostgreSQL formula: we're getting the postgresql-15 installed along the requested version postgresql-12.
So, if v15 starts first, it will take the 5432 port and won't have the correct conf files and won't allow connections. This doesn't always happen.

https://github.com/saltstack-formulas/postgres-formula/issues/324

I think the correct thing to do is to fix the issue on our formula fork at least, instead of writing some workaround on our installed salt code. Do you agree?

Actions #22

Updated by Peter Amstutz 9 months ago

Lucas Di Pentima wrote in #note-21:

The arvados-ws & PostgreSQL permission issue that I encountered seems to be related to a reported bug in the PostgreSQL formula: we're getting the postgresql-15 installed along the requested version postgresql-12.
So, if v15 starts first, it will take the 5432 port and won't have the correct conf files and won't allow connections. This doesn't always happen.

https://github.com/saltstack-formulas/postgres-formula/issues/324

I think the correct thing to do is to fix the issue on our formula fork at least, instead of writing some workaround on our installed salt code. Do you agree?

Yes.

Actions #23

Updated by Lucas Di Pentima 9 months ago

Update on the above comment: Although a 2nd PG is getting installed, it really wasn't causing the permission issue, it was just a configuration error on my part. The fix is needed but it's not urgent.

Updates at ea3ab8451

  • Changes the default installer cluster shape so that the controller node only gets database and controller nodes, and moves keepbalance, websocket and dispatcher to the workbench node.
  • Fixes the PG config file template to include the websocket IP on its ACL instead of the dispatcher's.
  • Updates the documentation simplifying the load-balancer instructions accordingly and also fixes some other issues related to a previous default cluster simplification change.
Actions #24

Updated by Lucas Di Pentima 9 months ago

I've migrated the scale cluster with the current installer version: 2 load-balanced controller nodes, the rest is the same.

It was a pretty painless experience: just destroyed the controller node and recreated it with a smaller filesystem, created the new controller{1,2} nodes via terraform, and run deploy.

Had to merge a last minute commit that fixes the instance profile assignment for the dispatcher, now that it's on the workbench node.
One interesting detail is when a cloud instance gets a new instance profilem it needs to get restarted for the cloud to notice the change.

Actions #25

Updated by Lucas Di Pentima 9 months ago

  • Status changed from In Progress to Resolved
Actions #26

Updated by Peter Amstutz 8 months ago

  • Release set to 66
Actions #27

Updated by Brett Smith 8 months ago

  • Related to Bug #20934: test-provision-centos7 is failing added
Actions

Also available in: Atom PDF