Project

General

Profile

Actions

Bug #19215

closed

improve multi-node installer experience

Added by Peter Amstutz about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Story points:
-
Release relationship:
Auto

Description

Finally got around to attempting to use the installer and it wasn't a great experience.

Improvements:

  1. The first step using the installer is to run a script and supply the desired configuration template.
    1. This will create a git repository, copy the starting files, and check them in
  2. Anything that needs attention has a FIXME with a long comment
  3. The provision script checks for FIXMEs
  4. The provision script checks that the config is checked in
  5. The provision script runs salt in masterless mode to deploy config changes

Also: ideally we would eliminate local.params and sed and use salt only, but according to Javier you can't use pillarstack (which allows us to declare top level variables and then refer to them later) in masterless mode. We should investigate if that's really true but that's the reasoning behind the current approach.


Subtasks 5 (0 open5 closed)

Task #19221: Review 18870-installerResolvedLucas Di Pentima06/28/2022Actions
Task #19233: Update salt install docsResolvedPeter Amstutz06/28/2022Actions
Task #19402: Review 19215-installerResolvedPeter Amstutz06/28/2022Actions
Task #19718: Test Terraform + install processResolvedLucas Di Pentima06/28/2022Actions
Task #19730: Review of the Terraform stuffResolvedPeter Amstutz06/28/2022Actions

Related issues

Related to Arvados Epics - Idea #18685: Synchronize configuration on multi-node clusterNewActions
Related to Arvados - Bug #19214: Mention updating /etc/hostsResolvedPeter Amstutz12/06/2022Actions
Related to Arvados - Idea #19364: Document use of diagnostics & health check to check running versions, config file matching, & overall cluster functioningResolvedTom Clegg10/12/2022Actions
Actions #1

Updated by Peter Amstutz about 2 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 2 years ago

  • Status changed from In Progress to New
  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 2 years ago

  • Target version set to 2022-07-06
  • Category set to Deployment
  • Subject changed from installer experience to improve multi-node installer experience
Actions #4

Updated by Peter Amstutz about 2 years ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz about 2 years ago

18870-installer @ 2cc1772745c19e438778c5832daf75f59112b9ab

  • Adds arvados/tools/salt-install/installer.sh

Enables you to assign roles to nodes and then copy the configuration and run the provision script on each node with a single 'deploy' command.

The 'initialize' command creates a git repository and copies the selected files into it.

The 'deploy' command pushes the git repository containing the configuration to each node and then runs the provision script from it.

The 'diagnostics' command runs 'arvados-client diagnostics'.

This branch also contains a small improvement to arvados-login-sync to ensure that errors from running 'usermod' and 'gpasswd' get logged, and a fix to the shell node 'cron' configuration that runs login-sync.

developer-run-tests: #3204

test-provision: #401

Actions #6

Updated by Lucas Di Pentima about 2 years ago

(Please take into account that I almost don't know anything about the salt installer, as I did a couple of reviews on its infancy and then didn't check on it ever since.)

  • It seems that the script needs some documentation.
  • I'm not sure which kind of problems is trying to solve and why those cannot be solved in the provision script itself. Does it tries to simplify what's explained at https://doc.arvados.org/v2.4/install/salt-multi-host.html ?
  • Do we need to do some automated testing on this new script? Do you have a recommended way of manually testing it?
  • File provision.sh:L261: The error message mentions ${CONFIG_FILE} but the grep was done recursively over ${CONFIG_DIR} so it'll be confusing when presented to the user.
  • The extensive use of globals on installer.sh makes following it somewhat tricky for me.
  • The script seems to rely on being run on a Debian-based distro, because it tries to install arvados-client with apt-get.
Actions #7

Updated by Peter Amstutz about 2 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Peter Amstutz about 2 years ago

Lucas Di Pentima wrote in #note-6:

(Please take into account that I almost don't know anything about the salt installer, as I did a couple of reviews on its infancy and then didn't check on it ever since.)

  • It seems that the script needs some documentation.

Agreed. I can share the install guide with you, and as I mentioned I need to update our install docs to describe it. And the code itself needs some comments.

The provision script only operates on the current node. The provision script installs salt, configures it for masterless mode, sets up the salt pillar, and then runs salt which performs the actual installation/configuration using arvados-formula.

In the multi-node install, the section "Run the provision.sh script" lists twelve commands that are supposed to be entered by the user, to run the provision script on each node. This part is tedious, error prone, and if you discover you made a mistake in the configuration and need to re-deploy, you have to do it all again. So the purpose of installer.sh is to automate the process of going through each node one-by-one and running the provision script.

It also encourages good habits by automatically putting your configuration in git, and automatically committing the configuration every time you run the deploy command.

  • Do we need to do some automated testing on this new script? Do you have a recommended way of manually testing it?

All the jenkins provision test jobs could use it. This one in particular:

https://ci.arvados.org/job/test-provision-multinode-debian11/

  • File provision.sh:L261: The error message mentions ${CONFIG_FILE} but the grep was done recursively over ${CONFIG_DIR} so it'll be confusing when presented to the user.

Sure. Actually I could take the filename out. Previously, it used "grep -q" so there's no output, but I made it so it would print out the actual lines that have "fixme" on them, so it's obvious to the user, and the error message itself doesn't have to say as much.

  • The extensive use of globals on installer.sh makes following it somewhat tricky for me.

That's bash. The functions could maybe be tweaked to take parameters.

A lot of variables get pulled in from local.params. The ones that are used by the script could be explicitly declared and mentioned in comments at the top.

  • The script seems to rely on being run on a Debian-based distro, because it tries to install arvados-client with apt-get.

I don't really want to write a bunch of code to detect the distro, so either we can leave it, or not try to automate it and just tell the user they need to install that package.

Actions #9

Updated by Lucas Di Pentima about 2 years ago

Peter Amstutz wrote in #note-8:

  • It seems that the script needs some documentation.

Agreed. I can share the install guide with you, and as I mentioned I need to update our install docs to describe it. And the code itself needs some comments.

Great, not sure if you were aiming for merging this regardless and complete the docs in a followup stage.

It also encourages good habits by automatically putting your configuration in git, and automatically committing the configuration every time you run the deploy command.

Do you think it would be nice to do an early check to see if ssh is set up correctly and git is installed on all nodes?

That's bash. The functions could maybe be tweaked to take parameters.
A lot of variables get pulled in from local.params. The ones that are used by the script could be explicitly declared and mentioned in comments at the top.

Yes, that would help a lot, thanks!

  • The script seems to rely on being run on a Debian-based distro, because it tries to install arvados-client with apt-get.

I don't really want to write a bunch of code to detect the distro, so either we can leave it, or not try to automate it and just tell the user they need to install that package.

Makes sense, I think the best option would be to just do an early check and notify the user to install the command if missing.

Actions #10

Updated by Peter Amstutz about 2 years ago

18870-installer @ 51f6ce76c7b1e0a297f8265122c2deef436a4065

  • added comments
  • reduce use of global variables & check for undefined variables
  • do the "sync" steps first so any ssh or git problems can be addressed up front

I intend to update our own salt install documentation in another branch, so we can keep the ticket open, but I'd like to

  1. merge this into main and 2.4-release
  2. adjust the jenkins jobs to use this
  3. send the site specific install guide back to the customer so we can do a fresh install whenever they are ready
Actions #11

Updated by Peter Amstutz about 2 years ago

  • Assigned To set to Peter Amstutz
Actions #12

Updated by Lucas Di Pentima about 2 years ago

The script is now much more readable, thank you. LGTM

Actions #13

Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2022-07-06 to 2022-07-20
Actions #14

Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2022-07-20 to 2022-08-03 Sprint
Actions #15

Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2022-08-03 Sprint to 2022-08-17 sprint
Actions #16

Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2022-08-17 sprint to 2022-08-31 sprint
Actions #17

Updated by Peter Amstutz about 2 years ago

19215-installer @ aa9d34fcd23b035000021615c0261174eb92c54c

Update the multi-node salt install to use installer.sh and include more detail about configuring on AWS.

Actions #18

Updated by Lucas Di Pentima about 2 years ago

Some comments:

  • Last sentence truncated at file doc/_includes/_multi_host_install_custom_certificates.liquid
  • At file doc/install/salt-multi-host.html.textile.liquid
    • Typo : "...These instructions include speciic details..."
    • Line 122-123: I would resist the temptation of giving too specific AWS instructions and instead maybe give an example of an IPv4 CIDR, wdyt?
    • I think section "Using a Let’s Encrypt certificate" needs the (AWS specific) mark
    • Line 139 (Configure Keep on S3): it mention an image that was created, which image is that? the following instructions are just a numbered list of just one item, do you whink it would make better sense to just not use a list? or maybe there're other edits that didn't got included yet? (or maybe it's a copy&paste error?)
    • Line 235: Mentions a cluster hostname outside our example hostname scheme.
    • Line 260: The test section could be eliminated in favor of the automated diagnostics.
Actions #19

Updated by Peter Amstutz about 2 years ago

  • Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Actions #20

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #21

Updated by Peter Amstutz almost 2 years ago

  • Related to Idea #18685: Synchronize configuration on multi-node cluster added
Actions #22

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Actions #23

Updated by Peter Amstutz almost 2 years ago

  • Related to Bug #19214: Mention updating /etc/hosts added
Actions #24

Updated by Peter Amstutz almost 2 years ago

19215-installer @ 73d236aafb4bc811f7a88cc53d7b15a4718c32bd

  • big rewrite of the single and multi host installer pages inspired by customer experiences
  • added page for the diagnostics tool, and added links on the other pages to recommend using it
  • hid pages for things we don't/can't actually support right now -- sections about Kubernetes, Arvados Composer
  • removed some unnecessary information, such as legacy linux distros we don't support any more
  • I made a couple of small tweaks to the installer.sh script to pipe to "tee" to capture run logs by default. I haven't tested it, we really ought to.
  • I made a tweak to the single host installs, because it was putting keep blocks in /tmp (!?) I adjusted it to put them in /var/lib/arvados/keep, and then added a salt config item to create that directory -- but that needs to be tested

I think for due diligence we definitely need to go through at least the single host install process.

Actions #25

Updated by Peter Amstutz almost 2 years ago

19215-installer @ 28be1aa52f405278fc41b2b1b67b19d5d7ef5fef

Went through the process inside a Docker container. This was helpful to debug some things but I ended up adding a note specifically advising against it (in particular, lightweight containers don't have an init system, and all our packages are designed to be managed by systemd).

We should spin up a VM somewhere and go through it and make sure it works.

Actions #26

Updated by Lucas Di Pentima almost 2 years ago

This is my first review pass, tested with a local Debian 11 VM:

Note: these comments are related to the single-node single-hostname method

  • Limitations: mentions the "/tmp" dir as being used for storing keep blocks, and now its on a different directory.
  • Cluster ID and base domain: I think references to variables like ${CLUSTER} and ${DOMAIN} would be more visible by enclosing them between '@'
  • Additional prerequisites: I got syntax errors on the passwordless sudo config, and it worked by setting the file to: %sudo ALL = (ALL) NOPASSWD: ALL (note the all-caps difference)
  • Download the installer: I would recommend changing the instructions to point to Github's servers, as the checkout from our own is super slow.
  • Initialize the installer: Mentions that he installer.sh script will synchronize the config between all nodes, but the doc page is about a single node install.
  • Edit local.params: the DATABASE_PASSWORD instructions could be merged with the KEY/TOKEN random string generation comments.
  • Install the CA root certificate (SSL_MODE=self-signed only): I think this section should be better placed just between "Begin installation" and "Confirm the cluster is working" so that the diagnostics run succeeds at first try
  • Confirm the cluster is working: probably will want to add a comment on installing arvados-client (or making the installer install it by default on the node), to avoid surprises on people following the guide in a "step-by-step/copy&pasting" fashion.
  • I'm getting a keepstore failure on the logs: keepstore: {"error":"error initializing volume xarv1-nyw5e-000000000000000: stat /var/lib/arvados/keep: no such file or directory","level":"error","msg":"exiting","time":"2022-10-06T11:34:02.552842224-03:00"}
  • I'm also getting a diagnostics error, and the odd thing about it is that the nginx service logs don't report anything:
    lucas@debian-gnu-linux-11:~/setup-arvados-xarv1$ ./installer.sh diagnostics -internal-client
    /usr/bin/arvados-client
    INFO      10: getting discovery document from https://xarv1.example.com/discovery/v1/apis/arvados/v1/rest
    ERROR     10: getting discovery document from https://xarv1.example.com/discovery/v1/apis/arvados/v1/rest (22 ms): request failed: https://xarv1.example.com/discovery/v1/apis/arvados/v1/rest: 500 Internal Server Error
    INFO      20: getting exported config from https://xarv1.example.com/arvados/v1/config
    ERROR     20: getting exported config from https://xarv1.example.com/arvados/v1/config (0 ms): request failed: https://xarv1.example.com/arvados/v1/config: 500 Internal Server Error
    ...
    

General suggestions

  • If we add --state-output=mixed to the salt-call execution in provision.sh Line 834, the output of the command will be a LOT less noisier, only reporting details when something fails. This is super helpful for debugging and probably the admin running the script won't care of successful config changes (or even non-changes).
Actions #27

Updated by Peter Amstutz almost 2 years ago

  • Related to Idea #19364: Document use of diagnostics & health check to check running versions, config file matching, & overall cluster functioning added
Actions #28

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Actions #29

Updated by Peter Amstutz almost 2 years ago

Lucas Di Pentima wrote in #note-26:

This is my first review pass, tested with a local Debian 11 VM:

Note: these comments are related to the single-node single-hostname method

  • Limitations: mentions the "/tmp" dir as being used for storing keep blocks, and now its on a different directory.

Fixed.

  • Cluster ID and base domain: I think references to variables like ${CLUSTER} and ${DOMAIN} would be more visible by enclosing them between '@'

I adjusted it in a few places, let me know if that's what you had in mind?

  • Additional prerequisites: I got syntax errors on the passwordless sudo config, and it worked by setting the file to: %sudo ALL = (ALL) NOPASSWD: ALL (note the all-caps difference)

Fixed.

  • Download the installer: I would recommend changing the instructions to point to Github's servers, as the checkout from our own is super slow.

Fixed.

  • Initialize the installer: Mentions that he installer.sh script will synchronize the config between all nodes, but the doc page is about a single node install.

The text is shared between them, I tweaked it.

  • Edit local.params: the DATABASE_PASSWORD instructions could be merged with the KEY/TOKEN random string generation comments.

I agree but I left it as is for now. There's extra notes about quoting the database password.

  • Install the CA root certificate (SSL_MODE=self-signed only): I think this section should be better placed just between "Begin installation" and "Confirm the cluster is working" so that the diagnostics run succeeds at first try

Fixed.

  • Confirm the cluster is working: probably will want to add a comment on installing arvados-client (or making the installer install it by default on the node), to avoid surprises on people following the guide in a "step-by-step/copy&pasting" fashion.

Fixed.

  • I'm getting a keepstore failure on the logs: keepstore: {"error":"error initializing volume xarv1-nyw5e-000000000000000: stat /var/lib/arvados/keep: no such file or directory","level":"error","msg":"exiting","time":"2022-10-06T11:34:02.552842224-03:00"}

That should be fixed now, the thing I wanted to do needed to be in the "states" directory not "pillars".

  • I'm also getting a diagnostics error, and the odd thing about it is that the nginx service logs don't report anything:
    [...]

I don't know anything about this, if you go through the install again and still get this again maybe you can diagnose it?

General suggestions

  • If we add --state-output=mixed to the salt-call execution in provision.sh Line 834, the output of the command will be a LOT less noisier, only reporting details when something fails. This is super helpful for debugging and probably the admin running the script won't care of successful config changes (or even non-changes).

I added this, I agree the default output is too verbose.

Updated 19215-installer @ feb45077eb3b766df63d048d492aaf7aca3d2be2

Actions #30

Updated by Lucas Di Pentima almost 2 years ago

Review update:

  • There's a 'l' char introduced in the tools/salt-install/config_examples/multi_host/aws/states/custom_certs.sls template (line 15)
  • When running ./installer.sh diagnostics I'm getting a failure because the installer script doesn't set the correct port on ARVADOS_API_HOST, making the diagnostics to attempt running against workbench2's endpoint.
  • When running arvados-client diagnostics with the correct envvars, webshell is failing (when using curl with the configured WebShell service URL, I'm getting a nginx welcome page.
    • I've tracked the problem at the provision.sh script, line 651: there's no nginx_shell_configuration.sls pillar template, I'm working on fixing it.
Actions #31

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Actions #32

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Actions #33

Updated by Peter Amstutz almost 2 years ago

  • Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Actions #34

Updated by Lucas Di Pentima almost 2 years ago

Updates at 134a34e7b

  • Fixed use of controller's port when running diagnostics.
  • Fixes webshell nginx config.
  • Fixes various typos on the documentation.
  • Fixes the installer script so that it honors the ${DEPLOY_USER} configuration.
  • Fixes IAM policy example documentation.
  • Adds Terraform code for multi-host infrastructure automated deployment on AWS.
  • Adds documentation on the multi-host install page.
  • Fixes the provision.sh script so that it uses our own postgres-formula fork while we wait for the proper fix to get merged. See: https://github.com/saltstack-formulas/postgres-formula/issues/327

Peter: It would be convenient for you to follow the new instructions to see if there's some missing indication or edge case on the terraform code. As a tip: don't use t2.micro instances because they take 3X the amount of time when you run the salt-installer on them versus the m5a.large (the current default).

Possible improvements

  • Ability to set up different instance types per service host
  • Ability to set up RDS as an external database
  • Compute node isolation on a separate private subnet
  • Ability to set up different filesystem size per service host
Actions #35

Updated by Lucas Di Pentima almost 2 years ago

Update at d6cccd7cd

  • Adds deploy_user output so that the admin can use it to set up the local.params file correctly.
Actions #36

Updated by Peter Amstutz almost 2 years ago

arvados/tools/salt-install/terraform/aws/data-storage has a file called .terraform.lock.hcl is that supposed to be there?

# Set to true if the database server won't be running in any service instance.
# Default: false
# use_external_db = true

The formatting is slightly confusing, the line # Default: false looks like it could be something that is supposed to be uncommented, but I think you're just saying that since use_external_db = true it is commented out, the default value is false.

Actions #37

Updated by Peter Amstutz almost 2 years ago

I've gone through the whole process. I incorporated terraform into the installer.sh script and I made a bunch of additional documentation edits to try and clarify things help the flow.

I think we're almost done, if you want to review my most recent edits?

19215-installer @ afe3ee7d4dc3c5820e3af561f81c8267c671180b

Actions #38

Updated by Lucas Di Pentima almost 2 years ago

Just a couple of minor details

  • At tools/salt-install/local.params.example.multiple_hosts - Lines 49-51: I believe those *_INT_IP vars should also be set to 10.1.1.x, right?
  • At doc/_includes/_download_installer.liquid - Line 31: I believe the 'installed' word could be removed from the sentence.

One thing that I realized I forgot to set up is a NAT gateway so that the compute nodes have internet access, I left it for later because for that I needed an additional public IP and I couldn't get one from my personal account. Would you like me to add that?

Actions #39

Updated by Peter Amstutz almost 2 years ago

Lucas Di Pentima wrote in #note-38:

Just a couple of minor details

  • At tools/salt-install/local.params.example.multiple_hosts - Lines 49-51: I believe those *_INT_IP vars should also be set to 10.1.1.x, right?
  • At doc/_includes/_download_installer.liquid - Line 31: I believe the 'installed' word could be removed from the sentence.

Fixed.

One thing that I realized I forgot to set up is a NAT gateway so that the compute nodes have internet access, I left it for later because for that I needed an additional public IP and I couldn't get one from my personal account. Would you like me to add that?

Yes, I think that would be a good idea.

Actions #40

Updated by Lucas Di Pentima almost 2 years ago

Updates at 92aa40f64

Additional review comments:

  • On local.params, I think CLUSTER_INT_CIDR var should be corrected to: 10.1.0.0/16 (already fixed in 2f69ea48b)
  • Do you think it would be a good idea to add (cd terraform/services && terraform output letsencrypt_iam_secret_access_key) after the last terraform apply so that the operator doesn't need to do it themselves? If so, we might need to update the docs removing the instructions for this.
Actions #41

Updated by Lucas Di Pentima almost 2 years ago

Updates at 370014043

  • Edits installer.sh to show letsencrypt_iam_secret_access_key's value.
  • Updates documentation reflecting latest changes.
Actions #42

Updated by Peter Amstutz almost 2 years ago

  • Status changed from In Progress to Resolved
Actions #43

Updated by Peter Amstutz over 1 year ago

  • Release set to 47
Actions

Also available in: Atom PDF