Project

General

Profile

Actions

Bug #19215

open

improve multi-node installer experience

Added by Peter Amstutz about 2 months ago. Updated about 16 hours ago.

Status:
In Progress
Priority:
Normal
Assigned To:
Category:
Deployment
Target version:
Start date:
06/28/2022
Due date:
% Done:

33%

Estimated time:
(Total: 0.00 h)
Story points:
-

Description

Finally got around to attempting to use the installer and it wasn't a great experience.

Improvements:

  1. The first step using the installer is to run a script and supply the desired configuration template.
    1. This will create a git repository, copy the starting files, and check them in
  2. Anything that needs attention has a FIXME with a long comment
  3. The provision script checks for FIXMEs
  4. The provision script checks that the config is checked in
  5. The provision script runs salt in masterless mode to deploy config changes

Also: ideally we would eliminate local.params and sed and use salt only, but according to Javier you can't use pillarstack (which allows us to declare top level variables and then refer to them later) in masterless mode. We should investigate if that's really true but that's the reasoning behind the current approach.


Subtasks 3 (2 open1 closed)

Task #19221: Review 18870-installerResolvedLucas Di Pentima06/28/2022

Actions
Task #19233: Update salt install docsNewPeter Amstutz06/28/2022

Actions
Task #19402: Review doc updateNewLucas Di Pentima06/28/2022

Actions
Actions #1

Updated by Peter Amstutz about 2 months ago

  • Status changed from New to In Progress
Actions #2

Updated by Peter Amstutz about 2 months ago

  • Status changed from In Progress to New
  • Description updated (diff)
Actions #3

Updated by Peter Amstutz about 2 months ago

  • Target version set to 2022-07-06
  • Category set to Deployment
  • Subject changed from installer experience to improve multi-node installer experience
Actions #4

Updated by Peter Amstutz about 2 months ago

  • Description updated (diff)
Actions #5

Updated by Peter Amstutz about 2 months ago

18870-installer @ 2cc1772745c19e438778c5832daf75f59112b9ab

  • Adds arvados/tools/salt-install/installer.sh

Enables you to assign roles to nodes and then copy the configuration and run the provision script on each node with a single 'deploy' command.

The 'initialize' command creates a git repository and copies the selected files into it.

The 'deploy' command pushes the git repository containing the configuration to each node and then runs the provision script from it.

The 'diagnostics' command runs 'arvados-client diagnostics'.

This branch also contains a small improvement to arvados-login-sync to ensure that errors from running 'usermod' and 'gpasswd' get logged, and a fix to the shell node 'cron' configuration that runs login-sync.

developer-run-tests: #3204

test-provision: #401

Actions #6

Updated by Lucas Di Pentima about 2 months ago

(Please take into account that I almost don't know anything about the salt installer, as I did a couple of reviews on its infancy and then didn't check on it ever since.)

  • It seems that the script needs some documentation.
  • I'm not sure which kind of problems is trying to solve and why those cannot be solved in the provision script itself. Does it tries to simplify what's explained at https://doc.arvados.org/v2.4/install/salt-multi-host.html ?
  • Do we need to do some automated testing on this new script? Do you have a recommended way of manually testing it?
  • File provision.sh:L261: The error message mentions ${CONFIG_FILE} but the grep was done recursively over ${CONFIG_DIR} so it'll be confusing when presented to the user.
  • The extensive use of globals on installer.sh makes following it somewhat tricky for me.
  • The script seems to rely on being run on a Debian-based distro, because it tries to install arvados-client with apt-get.
Actions #7

Updated by Peter Amstutz about 2 months ago

  • Status changed from New to In Progress
Actions #8

Updated by Peter Amstutz about 2 months ago

Lucas Di Pentima wrote in #note-6:

(Please take into account that I almost don't know anything about the salt installer, as I did a couple of reviews on its infancy and then didn't check on it ever since.)

  • It seems that the script needs some documentation.

Agreed. I can share the install guide with you, and as I mentioned I need to update our install docs to describe it. And the code itself needs some comments.

The provision script only operates on the current node. The provision script installs salt, configures it for masterless mode, sets up the salt pillar, and then runs salt which performs the actual installation/configuration using arvados-formula.

In the multi-node install, the section "Run the provision.sh script" lists twelve commands that are supposed to be entered by the user, to run the provision script on each node. This part is tedious, error prone, and if you discover you made a mistake in the configuration and need to re-deploy, you have to do it all again. So the purpose of installer.sh is to automate the process of going through each node one-by-one and running the provision script.

It also encourages good habits by automatically putting your configuration in git, and automatically committing the configuration every time you run the deploy command.

  • Do we need to do some automated testing on this new script? Do you have a recommended way of manually testing it?

All the jenkins provision test jobs could use it. This one in particular:

https://ci.arvados.org/job/test-provision-multinode-debian11/

  • File provision.sh:L261: The error message mentions ${CONFIG_FILE} but the grep was done recursively over ${CONFIG_DIR} so it'll be confusing when presented to the user.

Sure. Actually I could take the filename out. Previously, it used "grep -q" so there's no output, but I made it so it would print out the actual lines that have "fixme" on them, so it's obvious to the user, and the error message itself doesn't have to say as much.

  • The extensive use of globals on installer.sh makes following it somewhat tricky for me.

That's bash. The functions could maybe be tweaked to take parameters.

A lot of variables get pulled in from local.params. The ones that are used by the script could be explicitly declared and mentioned in comments at the top.

  • The script seems to rely on being run on a Debian-based distro, because it tries to install arvados-client with apt-get.

I don't really want to write a bunch of code to detect the distro, so either we can leave it, or not try to automate it and just tell the user they need to install that package.

Actions #9

Updated by Lucas Di Pentima about 2 months ago

Peter Amstutz wrote in #note-8:

  • It seems that the script needs some documentation.

Agreed. I can share the install guide with you, and as I mentioned I need to update our install docs to describe it. And the code itself needs some comments.

Great, not sure if you were aiming for merging this regardless and complete the docs in a followup stage.

It also encourages good habits by automatically putting your configuration in git, and automatically committing the configuration every time you run the deploy command.

Do you think it would be nice to do an early check to see if ssh is set up correctly and git is installed on all nodes?

That's bash. The functions could maybe be tweaked to take parameters.
A lot of variables get pulled in from local.params. The ones that are used by the script could be explicitly declared and mentioned in comments at the top.

Yes, that would help a lot, thanks!

  • The script seems to rely on being run on a Debian-based distro, because it tries to install arvados-client with apt-get.

I don't really want to write a bunch of code to detect the distro, so either we can leave it, or not try to automate it and just tell the user they need to install that package.

Makes sense, I think the best option would be to just do an early check and notify the user to install the command if missing.

Actions #10

Updated by Peter Amstutz about 2 months ago

18870-installer @ 51f6ce76c7b1e0a297f8265122c2deef436a4065

  • added comments
  • reduce use of global variables & check for undefined variables
  • do the "sync" steps first so any ssh or git problems can be addressed up front

I intend to update our own salt install documentation in another branch, so we can keep the ticket open, but I'd like to

  1. merge this into main and 2.4-release
  2. adjust the jenkins jobs to use this
  3. send the site specific install guide back to the customer so we can do a fresh install whenever they are ready
Actions #11

Updated by Peter Amstutz about 2 months ago

  • Assigned To set to Peter Amstutz
Actions #12

Updated by Lucas Di Pentima about 2 months ago

The script is now much more readable, thank you. LGTM

Actions #13

Updated by Peter Amstutz about 1 month ago

  • Target version changed from 2022-07-06 to 2022-07-20
Actions #14

Updated by Peter Amstutz 28 days ago

  • Target version changed from 2022-07-20 to 2022-08-03 Sprint
Actions #15

Updated by Peter Amstutz 15 days ago

  • Target version changed from 2022-08-03 Sprint to 2022-08-17 sprint
Actions #16

Updated by Peter Amstutz about 16 hours ago

  • Target version changed from 2022-08-17 sprint to 2022-08-31 sprint
Actions

Also available in: Atom PDF