Project

General

Profile

Actions

Feature #21832

closed

Installer's Terraform code allows setting up an RDS instance on AWS

Added by Lucas Di Pentima about 2 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Story points:
-

Description

Instead of fully managing a database service node, allow the admin to create an RDS instance on AWS.

Research on how to get alerting on a DB almost full: https://repost.aws/knowledge-center/storage-full-rds-cloudwatch-alarm


Subtasks 1 (0 open1 closed)

Task #21867: Review 21832-installer-rds-supportResolvedLucas Di Pentima06/14/2024Actions

Related issues

Related to Arvados - Feature #21896: Installer supports RDS alertingNewActions
Related to Arvados - Feature #21897: Installer supports Multi Availability Zone RDSNewActions
Actions #1

Updated by Peter Amstutz about 2 months ago

  • Target version changed from Development 2024-06-05 sprint to Development 2024-06-19 sprint
Actions #2

Updated by Lucas Di Pentima about 2 months ago

  • Status changed from New to In Progress
Actions #3

Updated by Lucas Di Pentima about 2 months ago

  • Description updated (diff)
Actions #4

Updated by Lucas Di Pentima about 1 month ago

21832-installer-rds-support @ 8205cd5

test-provision: #894

  • All agreed upon points are implemented / addressed.
    • Everything except the CloudWatch alerts.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • Not yet, haven't talked with the team if this should be an official installer feature.
  • Code is tested and passing, both automated and manual, what manual testing was done is described
    • Manual tests on the multi-node scenario performed on the sandbox account.
  • Documentation has been updated.
    • New Terraform input params documented on the terrafom.tfvars that get picked up by the docs building scripts.
  • Behaves appropriately at the intended scale (describe intended scale).
    • No change in scale.
  • Considered backwards and forwards compatibility issues between client and server.
    • Yes: This feature is compatible with the already existing DATABASE_EXTERNAL_SERVICE_HOST_OR_IP parameter at local.params file.
  • Follows our coding standards and GUI style guidelines.
    • N/A

This branch adds use_rds boolean parameters in both vpc/ and services/ Terraform directories, each one controlling a related part of the cloud infrastructure needed for RDS to work. There're also other optional parameters to further customize the deployment, but enabling both use_rds params is enough to deploy an RDS instance with a randomly generated password.

Terraform's output includes the database's name, address, username & password as outputs so that they can be used in the local.params and local.params.secrets files.

When DATABASE_EXTERNAL_SERVICE_HOST_OR_IP is set, a new state is installed in the controller node that will make sure the trigram extension is enabled. Also, it'll skip installing a local postgresql server if there's a node set with the database role.

Actions #5

Updated by Brett Smith about 1 month ago

I have two questions mostly because of my unfamiliarity with Salt.

One, am I following right that our Terraform code chooses the AZ for everything? And it picks the first AZ (the -a one) for most resources, and then the second (-b) for RDS? This is not a parameter the admin can control at all?

Second, was there any particular rationale for setting the initial size at 20GB? My information might be out of date, but last I checked, I belive RDS instances grow at a static rate of 1-2GB when needed, and there's a little performance hit every time this happens. That's probably fine for an instance that's truly starting from scratch, but might get annoying if you plan to arv-copy a bunch of data over from somewhere else. I wonder if it would make more sense to default the initial allocation as some percentage of the admin's total allocation? Say 20% or something?

In general I wonder if we should give the admin more control over the RDS creation parameters—in the existing deploy where PostgreSQL lives on the controller node, they implicitly have a lot of control just by choosing the controller instance type. But I'm totally fine with that being a follow-up story/ies.

  • All agreed upon points are implemented / addressed.
    • Everything except the CloudWatch alerts.
  • Anything not implemented (discovered or discussed during work) has a follow-up story.
    • Not yet, haven't talked with the team if this should be an official installer feature.

If we decide it should be then I think the CloudWatch alerts need to have a follow-up story too.

Thanks.

Actions #6

Updated by Lucas Di Pentima about 1 month ago

Brett Smith wrote in #note-5:

One, am I following right that our Terraform code chooses the AZ for everything? And it picks the first AZ (the -a one) for most resources, and then the second (-b) for RDS? This is not a parameter the admin can control at all?

The Terraform code never allowed to specify a custom AZ, just used the first one of the configured region. Now we also use the second one, but just because the RDS subnet group requires more than one AZ (even if the RDS itself is configured as a single-AZ instance, like in our case -- see 2867dc0 commit message).

This ticket might be under-specified: I can add the possibility of configuring a custom AZ, but wanted to clarify that we never did, it's not a regression of this branch.

Second, was there any particular rationale for setting the initial size at 20GB? My information might be out of date, but last I checked, I belive RDS instances grow at a static rate of 1-2GB when needed, and there's a little performance hit every time this happens. That's probably fine for an instance that's truly starting from scratch, but might get annoying if you plan to arv-copy a bunch of data over from somewhere else. I wonder if it would make more sense to default the initial allocation as some percentage of the admin's total allocation? Say 20% or something?

Can you elaborate on what "admin's total allocation" is? If I understood the documentation correctly, the storage auto scales in incerements of at least 10GiB (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html#USER_PIOPS.Autoscaling), and also I'm reading that for the autoscale to happen, it has to pass at least 6 hours from the last autoscale event, so maybe 20 GiB is too small, do you agree?

In general I wonder if we should give the admin more control over the RDS creation parameters—in the existing deploy where PostgreSQL lives on the controller node, they implicitly have a lot of control just by choosing the controller instance type. But I'm totally fine with that being a follow-up story/ies.

These are the customizable parameters the admin can currently tweak:

# Provide custom values if needed.
# rds_username = "" 
# rds_password = "" 
# rds_instance_type = "db.m5.xlarge" 
# rds_max_allocated_storage = 1000

Note that you can specify the instance type, and max db size. We can also add a customizable allocated_storage variable, and change its default from 20 to 500 GiB or even 1 TiB as this multi-node variant of the installer is already designed for production clusters. WDYT?

  • Not yet, haven't talked with the team if this should be an official installer feature.

If we decide it should be then I think the CloudWatch alerts need to have a follow-up story too.

Yeah, we have historically tried to avoid lock ourselves into AWS too much. That's why I didn't went ahead with CloudWatch from the start, as we might want to instead use something like this: https://github.com/qonto/prometheus-rds-exporter

Actions #7

Updated by Brett Smith about 1 month ago

Lucas Di Pentima wrote in #note-6:

This ticket might be under-specified: I can add the possibility of configuring a custom AZ, but wanted to clarify that we never did, it's not a regression of this branch.

I understand, and that's fine, I agree it doesn't need to be in scope for this ticket. I started down this train of thought because I asked myself "how do we make sure the secondary RDS AZ is different from the first?" Having it all hardcoded certainly does the job, and that's fine.

Can you elaborate on what "admin's total allocation" is?

I meant the value they set for rds_max_allocated_storage.

If I understood the documentation correctly, the storage auto scales in incerements of at least 10GiB (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html#USER_PIOPS.Autoscaling), and also I'm reading that for the autoscale to happen, it has to pass at least 6 hours from the last autoscale event, so maybe 20 GiB is too small, do you agree?

Yeah, that's what I mean. I see the docs say:

The additional storage is in increments of whichever of the following is greater:
  • 10 GiB
  • 10 percent of currently allocated storage
  • Predicted storage growth exceeding the current allocated storage size in the next 7 hours based on the FreeStorageSpace metrics from the past hour. For more information on metrics, see Monitoring with Amazon CloudWatch.

With our default settings, that second bullet means that it will grow by at least 30GiB each time. On the one hand, that's good, it means performance hits won't happen as often as I thought. On the other hand, it makes me think that starting with less storage than that is probably not a good default. So yeah, I think having allocated_storage default to 20% of rds_max_allocated_storage, and maybe more, would be a nice improvement.

Note that you can specify the instance type, and max db size. We can also add a customizable allocated_storage variable, and change its default from 20 to 500 GiB or even 1 TiB as this multi-node variant of the installer is already designed for production clusters. WDYT?

I can imagine admins might be interested in controlling that initial_allocated storage setting as you said, as well as potentially the values of engine_version (they might be standardized on a particular version for all deployments), skip_final_snapshot, backup_retention_period, and multi_az if they would like more redundancy. Again, I am fine with any or all of these being follow-up stories.

Thanks.

Actions #8

Updated by Brett Smith about 1 month ago

Brett Smith wrote in #note-7:

With our default settings, that second bullet means that it will grow by at least 30GiB each time.

I see now I misread the docs and it's 10% of the current allocated storage, so 10GiB would be greater than our default (2GiB). Sorry, I'm very tired. I do still think 20+% of rds_max_allocated_storage would be a nicer default though.

Actions #9

Updated by Lucas Di Pentima about 1 month ago

Rebased and added updates at 4112f0d830 - test-provision: #899

Adds customization for:

  • Allocated storage size. Defaults to 60 GiB (while keeping the max storage default value to 300 GiB).
  • Backup retention period, with an option to disable it entirely.
  • Postgresql version. Default to 15 (the same as the salt installer)
  • Performing a final backup before deletion. Defaults to true.

Current set of RDS related knobs:

# Provide custom values if needed.
# rds_username = "" 
# rds_password = "" 
# rds_instance_type = "db.m5.xlarge" 
# rds_postgresql_version = "16.3" 
# rds_allocated_storage = 200
# rds_max_allocated_storage = 1000
# rds_backup_retention_period = 30
# rds_backup_before_deletion = false
# rds_final_backup_name = "" 

I left the Multi-AZ customization for another story because it may require additional work on other services.

Actions #10

Updated by Lucas Di Pentima about 1 month ago

Actions #12

Updated by Lucas Di Pentima about 1 month ago

  • Related to Feature #21897: Installer supports Multi Availability Zone RDS added
Actions #13

Updated by Brett Smith about 1 month ago

Lucas Di Pentima wrote in #note-9:

Rebased and added updates at 4112f0d830 - test-provision: #899

Looks good to me, thanks.

Actions #14

Updated by Lucas Di Pentima about 1 month ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF