Bug #19215
closedimprove multi-node installer experience
Added by Peter Amstutz over 2 years ago. Updated about 2 years ago.
Description
Finally got around to attempting to use the installer and it wasn't a great experience.
Improvements:
- The first step using the installer is to run a script and supply the desired configuration template.
- This will create a git repository, copy the starting files, and check them in
- Anything that needs attention has a FIXME with a long comment
- The provision script checks for FIXMEs
- The provision script checks that the config is checked in
- The provision script runs salt in masterless mode to deploy config changes
Also: ideally we would eliminate local.params
and sed
and use salt only, but according to Javier you can't use pillarstack (which allows us to declare top level variables and then refer to them later) in masterless mode. We should investigate if that's really true but that's the reasoning behind the current approach.
Updated by Peter Amstutz over 2 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 2 years ago
- Status changed from In Progress to New
- Description updated (diff)
Updated by Peter Amstutz over 2 years ago
- Target version set to 2022-07-06
- Category set to Deployment
- Subject changed from installer experience to improve multi-node installer experience
Updated by Peter Amstutz over 2 years ago
18870-installer @ 2cc1772745c19e438778c5832daf75f59112b9ab
- Adds arvados/tools/salt-install/installer.sh
Enables you to assign roles to nodes and then copy the configuration and run the provision script on each node with a single 'deploy' command.
The 'initialize' command creates a git repository and copies the selected files into it.
The 'deploy' command pushes the git repository containing the configuration to each node and then runs the provision script from it.
The 'diagnostics' command runs 'arvados-client diagnostics'.
This branch also contains a small improvement to arvados-login-sync to ensure that errors from running 'usermod' and 'gpasswd' get logged, and a fix to the shell node 'cron' configuration that runs login-sync.
Updated by Lucas Di Pentima over 2 years ago
(Please take into account that I almost don't know anything about the salt installer, as I did a couple of reviews on its infancy and then didn't check on it ever since.)
- It seems that the script needs some documentation.
- I'm not sure which kind of problems is trying to solve and why those cannot be solved in the provision script itself. Does it tries to simplify what's explained at https://doc.arvados.org/v2.4/install/salt-multi-host.html ?
- Do we need to do some automated testing on this new script? Do you have a recommended way of manually testing it?
- File
provision.sh:L261
: The error message mentions${CONFIG_FILE}
but thegrep
was done recursively over${CONFIG_DIR}
so it'll be confusing when presented to the user. - The extensive use of globals on
installer.sh
makes following it somewhat tricky for me. - The script seems to rely on being run on a Debian-based distro, because it tries to install
arvados-client
withapt-get
.
Updated by Peter Amstutz over 2 years ago
- Status changed from New to In Progress
Updated by Peter Amstutz over 2 years ago
Lucas Di Pentima wrote in #note-6:
(Please take into account that I almost don't know anything about the salt installer, as I did a couple of reviews on its infancy and then didn't check on it ever since.)
- It seems that the script needs some documentation.
Agreed. I can share the install guide with you, and as I mentioned I need to update our install docs to describe it. And the code itself needs some comments.
- I'm not sure which kind of problems is trying to solve and why those cannot be solved in the provision script itself. Does it tries to simplify what's explained at https://doc.arvados.org/v2.4/install/salt-multi-host.html ?
The provision script only operates on the current node. The provision script installs salt, configures it for masterless mode, sets up the salt pillar, and then runs salt which performs the actual installation/configuration using arvados-formula.
In the multi-node install, the section "Run the provision.sh script" lists twelve commands that are supposed to be entered by the user, to run the provision script on each node. This part is tedious, error prone, and if you discover you made a mistake in the configuration and need to re-deploy, you have to do it all again. So the purpose of installer.sh is to automate the process of going through each node one-by-one and running the provision script.
It also encourages good habits by automatically putting your configuration in git, and automatically committing the configuration every time you run the deploy command.
- Do we need to do some automated testing on this new script? Do you have a recommended way of manually testing it?
All the jenkins provision test jobs could use it. This one in particular:
https://ci.arvados.org/job/test-provision-multinode-debian11/
- File
provision.sh:L261
: The error message mentions${CONFIG_FILE}
but thegrep
was done recursively over${CONFIG_DIR}
so it'll be confusing when presented to the user.
Sure. Actually I could take the filename out. Previously, it used "grep -q" so there's no output, but I made it so it would print out the actual lines that have "fixme" on them, so it's obvious to the user, and the error message itself doesn't have to say as much.
- The extensive use of globals on
installer.sh
makes following it somewhat tricky for me.
That's bash. The functions could maybe be tweaked to take parameters.
A lot of variables get pulled in from local.params. The ones that are used by the script could be explicitly declared and mentioned in comments at the top.
- The script seems to rely on being run on a Debian-based distro, because it tries to install
arvados-client
withapt-get
.
I don't really want to write a bunch of code to detect the distro, so either we can leave it, or not try to automate it and just tell the user they need to install that package.
Updated by Lucas Di Pentima over 2 years ago
Peter Amstutz wrote in #note-8:
- It seems that the script needs some documentation.
Agreed. I can share the install guide with you, and as I mentioned I need to update our install docs to describe it. And the code itself needs some comments.
Great, not sure if you were aiming for merging this regardless and complete the docs in a followup stage.
It also encourages good habits by automatically putting your configuration in git, and automatically committing the configuration every time you run the deploy command.
Do you think it would be nice to do an early check to see if ssh is set up correctly and git is installed on all nodes?
That's bash. The functions could maybe be tweaked to take parameters.
A lot of variables get pulled in from local.params. The ones that are used by the script could be explicitly declared and mentioned in comments at the top.
Yes, that would help a lot, thanks!
- The script seems to rely on being run on a Debian-based distro, because it tries to install
arvados-client
withapt-get
.I don't really want to write a bunch of code to detect the distro, so either we can leave it, or not try to automate it and just tell the user they need to install that package.
Makes sense, I think the best option would be to just do an early check and notify the user to install the command if missing.
Updated by Peter Amstutz over 2 years ago
18870-installer @ 51f6ce76c7b1e0a297f8265122c2deef436a4065
- added comments
- reduce use of global variables & check for undefined variables
- do the "sync" steps first so any ssh or git problems can be addressed up front
I intend to update our own salt install documentation in another branch, so we can keep the ticket open, but I'd like to
- merge this into main and 2.4-release
- adjust the jenkins jobs to use this
- send the site specific install guide back to the customer so we can do a fresh install whenever they are ready
Updated by Lucas Di Pentima over 2 years ago
The script is now much more readable, thank you. LGTM
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-07-06 to 2022-07-20
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-07-20 to 2022-08-03 Sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-08-03 Sprint to 2022-08-17 sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-08-17 sprint to 2022-08-31 sprint
Updated by Peter Amstutz over 2 years ago
19215-installer @ aa9d34fcd23b035000021615c0261174eb92c54c
Update the multi-node salt install to use installer.sh and include more detail about configuring on AWS.
Updated by Lucas Di Pentima over 2 years ago
Some comments:
- Last sentence truncated at file
doc/_includes/_multi_host_install_custom_certificates.liquid
- At file
doc/install/salt-multi-host.html.textile.liquid
- Typo : "...These instructions include speciic details..."
- Line 122-123: I would resist the temptation of giving too specific AWS instructions and instead maybe give an example of an IPv4 CIDR, wdyt?
- I think section "Using a Let’s Encrypt certificate" needs the (AWS specific) mark
- Line 139 (Configure Keep on S3): it mention an image that was created, which image is that? the following instructions are just a numbered list of just one item, do you whink it would make better sense to just not use a list? or maybe there're other edits that didn't got included yet? (or maybe it's a copy&paste error?)
- Line 235: Mentions a cluster hostname outside our example hostname scheme.
- Line 260: The test section could be eliminated in favor of the automated diagnostics.
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-08-31 sprint to 2022-09-14 sprint
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Updated by Peter Amstutz over 2 years ago
- Related to Idea #18685: Synchronize configuration on multi-node cluster added
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Updated by Peter Amstutz over 2 years ago
- Related to Bug #19214: Mention updating /etc/hosts added
Updated by Peter Amstutz over 2 years ago
19215-installer @ 73d236aafb4bc811f7a88cc53d7b15a4718c32bd
- big rewrite of the single and multi host installer pages inspired by customer experiences
- added page for the diagnostics tool, and added links on the other pages to recommend using it
- hid pages for things we don't/can't actually support right now -- sections about Kubernetes, Arvados Composer
- removed some unnecessary information, such as legacy linux distros we don't support any more
- I made a couple of small tweaks to the installer.sh script to pipe to "tee" to capture run logs by default. I haven't tested it, we really ought to.
- I made a tweak to the single host installs, because it was putting keep blocks in /tmp (!?) I adjusted it to put them in /var/lib/arvados/keep, and then added a salt config item to create that directory -- but that needs to be tested
I think for due diligence we definitely need to go through at least the single host install process.
Updated by Peter Amstutz over 2 years ago
19215-installer @ 28be1aa52f405278fc41b2b1b67b19d5d7ef5fef
Went through the process inside a Docker container. This was helpful to debug some things but I ended up adding a note specifically advising against it (in particular, lightweight containers don't have an init system, and all our packages are designed to be managed by systemd).
We should spin up a VM somewhere and go through it and make sure it works.
Updated by Lucas Di Pentima about 2 years ago
This is my first review pass, tested with a local Debian 11 VM:
Note: these comments are related to the single-node single-hostname method¶
- Limitations: mentions the "/tmp" dir as being used for storing keep blocks, and now its on a different directory.
- Cluster ID and base domain: I think references to variables like ${CLUSTER} and ${DOMAIN} would be more visible by enclosing them between '@'
- Additional prerequisites: I got syntax errors on the passwordless sudo config, and it worked by setting the file to: %sudo ALL = (ALL) NOPASSWD: ALL (note the all-caps difference)
- Download the installer: I would recommend changing the instructions to point to Github's servers, as the checkout from our own is super slow.
- Initialize the installer: Mentions that he installer.sh script will synchronize the config between all nodes, but the doc page is about a single node install.
- Edit local.params: the DATABASE_PASSWORD instructions could be merged with the KEY/TOKEN random string generation comments.
- Install the CA root certificate (SSL_MODE=self-signed only): I think this section should be better placed just between "Begin installation" and "Confirm the cluster is working" so that the diagnostics run succeeds at first try
- Confirm the cluster is working: probably will want to add a comment on installing arvados-client (or making the installer install it by default on the node), to avoid surprises on people following the guide in a "step-by-step/copy&pasting" fashion.
- I'm getting a keepstore failure on the logs: keepstore: {"error":"error initializing volume xarv1-nyw5e-000000000000000: stat /var/lib/arvados/keep: no such file or directory","level":"error","msg":"exiting","time":"2022-10-06T11:34:02.552842224-03:00"}
- I'm also getting a diagnostics error, and the odd thing about it is that the
nginx
service logs don't report anything:lucas@debian-gnu-linux-11:~/setup-arvados-xarv1$ ./installer.sh diagnostics -internal-client /usr/bin/arvados-client INFO 10: getting discovery document from https://xarv1.example.com/discovery/v1/apis/arvados/v1/rest ERROR 10: getting discovery document from https://xarv1.example.com/discovery/v1/apis/arvados/v1/rest (22 ms): request failed: https://xarv1.example.com/discovery/v1/apis/arvados/v1/rest: 500 Internal Server Error INFO 20: getting exported config from https://xarv1.example.com/arvados/v1/config ERROR 20: getting exported config from https://xarv1.example.com/arvados/v1/config (0 ms): request failed: https://xarv1.example.com/arvados/v1/config: 500 Internal Server Error ...
General suggestions¶
- If we add
--state-output=mixed
to thesalt-call
execution inprovision.sh
Line 834, the output of the command will be a LOT less noisier, only reporting details when something fails. This is super helpful for debugging and probably the admin running the script won't care of successful config changes (or even non-changes).
Updated by Peter Amstutz about 2 years ago
- Related to Idea #19364: Document use of diagnostics & health check to check running versions, config file matching, & overall cluster functioning added
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Updated by Peter Amstutz about 2 years ago
Lucas Di Pentima wrote in #note-26:
This is my first review pass, tested with a local Debian 11 VM:
Note: these comments are related to the single-node single-hostname method¶
- Limitations: mentions the "/tmp" dir as being used for storing keep blocks, and now its on a different directory.
Fixed.
- Cluster ID and base domain: I think references to variables like ${CLUSTER} and ${DOMAIN} would be more visible by enclosing them between '@'
I adjusted it in a few places, let me know if that's what you had in mind?
- Additional prerequisites: I got syntax errors on the passwordless sudo config, and it worked by setting the file to: %sudo ALL = (ALL) NOPASSWD: ALL (note the all-caps difference)
Fixed.
- Download the installer: I would recommend changing the instructions to point to Github's servers, as the checkout from our own is super slow.
Fixed.
- Initialize the installer: Mentions that he installer.sh script will synchronize the config between all nodes, but the doc page is about a single node install.
The text is shared between them, I tweaked it.
- Edit local.params: the DATABASE_PASSWORD instructions could be merged with the KEY/TOKEN random string generation comments.
I agree but I left it as is for now. There's extra notes about quoting the database password.
- Install the CA root certificate (SSL_MODE=self-signed only): I think this section should be better placed just between "Begin installation" and "Confirm the cluster is working" so that the diagnostics run succeeds at first try
Fixed.
- Confirm the cluster is working: probably will want to add a comment on installing arvados-client (or making the installer install it by default on the node), to avoid surprises on people following the guide in a "step-by-step/copy&pasting" fashion.
Fixed.
- I'm getting a keepstore failure on the logs: keepstore: {"error":"error initializing volume xarv1-nyw5e-000000000000000: stat /var/lib/arvados/keep: no such file or directory","level":"error","msg":"exiting","time":"2022-10-06T11:34:02.552842224-03:00"}
That should be fixed now, the thing I wanted to do needed to be in the "states" directory not "pillars".
- I'm also getting a diagnostics error, and the odd thing about it is that the
nginx
service logs don't report anything:
[...]
I don't know anything about this, if you go through the install again and still get this again maybe you can diagnose it?
General suggestions¶
- If we add
--state-output=mixed
to thesalt-call
execution inprovision.sh
Line 834, the output of the command will be a LOT less noisier, only reporting details when something fails. This is super helpful for debugging and probably the admin running the script won't care of successful config changes (or even non-changes).
I added this, I agree the default output is too verbose.
Updated 19215-installer @ feb45077eb3b766df63d048d492aaf7aca3d2be2
Updated by Lucas Di Pentima about 2 years ago
Review update:
- There's a 'l' char introduced in the
tools/salt-install/config_examples/multi_host/aws/states/custom_certs.sls
template (line 15) - When running
./installer.sh diagnostics
I'm getting a failure because the installer script doesn't set the correct port onARVADOS_API_HOST
, making the diagnostics to attempt running against workbench2's endpoint.- Already fixed at ca44f32
- When running
arvados-client diagnostics
with the correct envvars, webshell is failing (when using curl with the configured WebShell service URL, I'm getting a nginx welcome page.- I've tracked the problem at the
provision.sh
script, line 651: there's nonginx_shell_configuration.sls
pillar template, I'm working on fixing it.
- I've tracked the problem at the
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Updated by Lucas Di Pentima about 2 years ago
Updates at 134a34e7b
- Fixed use of controller's port when running diagnostics.
- Fixes webshell nginx config.
- Fixes various typos on the documentation.
- Fixes the installer script so that it honors the
${DEPLOY_USER}
configuration. - Fixes IAM policy example documentation.
- Adds Terraform code for multi-host infrastructure automated deployment on AWS.
- Adds documentation on the multi-host install page.
- Fixes the
provision.sh
script so that it uses our ownpostgres-formula
fork while we wait for the proper fix to get merged. See: https://github.com/saltstack-formulas/postgres-formula/issues/327
Peter: It would be convenient for you to follow the new instructions to see if there's some missing indication or edge case on the terraform code. As a tip: don't use t2.micro
instances because they take 3X the amount of time when you run the salt-installer on them versus the m5a.large
(the current default).
Possible improvements¶
- Ability to set up different instance types per service host
- Ability to set up RDS as an external database
- Compute node isolation on a separate private subnet
- Ability to set up different filesystem size per service host
Updated by Lucas Di Pentima about 2 years ago
Update at d6cccd7cd
- Adds
deploy_user
output so that the admin can use it to set up thelocal.params
file correctly.
Updated by Peter Amstutz about 2 years ago
arvados/tools/salt-install/terraform/aws/data-storage
has a file called .terraform.lock.hcl
is that supposed to be there?
# Set to true if the database server won't be running in any service instance. # Default: false # use_external_db = true
The formatting is slightly confusing, the line # Default: false
looks like it could be something that is supposed to be uncommented, but I think you're just saying that since use_external_db = true
it is commented out, the default value is false.
Updated by Peter Amstutz about 2 years ago
I've gone through the whole process. I incorporated terraform into the installer.sh script and I made a bunch of additional documentation edits to try and clarify things help the flow.
I think we're almost done, if you want to review my most recent edits?
19215-installer @ afe3ee7d4dc3c5820e3af561f81c8267c671180b
Updated by Lucas Di Pentima about 2 years ago
Just a couple of minor details
- At
tools/salt-install/local.params.example.multiple_hosts
- Lines 49-51: I believe those*_INT_IP
vars should also be set to 10.1.1.x, right? - At
doc/_includes/_download_installer.liquid
- Line 31: I believe the 'installed' word could be removed from the sentence.
One thing that I realized I forgot to set up is a NAT gateway so that the compute nodes have internet access, I left it for later because for that I needed an additional public IP and I couldn't get one from my personal account. Would you like me to add that?
Updated by Peter Amstutz about 2 years ago
Lucas Di Pentima wrote in #note-38:
Just a couple of minor details
- At
tools/salt-install/local.params.example.multiple_hosts
- Lines 49-51: I believe those*_INT_IP
vars should also be set to 10.1.1.x, right?- At
doc/_includes/_download_installer.liquid
- Line 31: I believe the 'installed' word could be removed from the sentence.
Fixed.
One thing that I realized I forgot to set up is a NAT gateway so that the compute nodes have internet access, I left it for later because for that I needed an additional public IP and I couldn't get one from my personal account. Would you like me to add that?
Yes, I think that would be a good idea.
Updated by Lucas Di Pentima about 2 years ago
Updates at 92aa40f64
- Adds private subnet + NAT gatway for compute nodes, so they can access the Internet
- Manually confirmed it's working using the following workflow: https://github.com/common-workflow-lab/cwl_rna_seq
Additional review comments:¶
- On local.params, I think
CLUSTER_INT_CIDR
var should be corrected to: 10.1.0.0/16 (already fixed in 2f69ea48b) - Do you think it would be a good idea to add
(cd terraform/services && terraform output letsencrypt_iam_secret_access_key)
after the last terraformapply
so that the operator doesn't need to do it themselves? If so, we might need to update the docs removing the instructions for this.
Updated by Lucas Di Pentima about 2 years ago
Updates at 370014043
- Edits
installer.sh
to showletsencrypt_iam_secret_access_key
's value. - Updates documentation reflecting latest changes.
Updated by Peter Amstutz about 2 years ago
- Status changed from In Progress to Resolved