Project

General

Profile

Actions

Support #21890

closed

Document everything terraform and salt installs do

Added by Peter Amstutz about 1 month ago. Updated 11 days ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Due date:
Story points:
-

Description

We are considering migrating from our current stack (terraform + installer.sh + provision.sh + salt) to an Ansible playbook that handles everything (cloud provisioning + software installation + configuration). The reasons for this are:

  • fewer tools to learn / less complexity / eliminating shell script glue
  • more robust ecosystem and long term support (salt and terraform are both in stormy waters due to churn at their main sponsors)

As a first step, we should make a detailed list of all the things that the current stack is responsible for, so that we can ensure that a new installer covers them all.


Subtasks 1 (0 open1 closed)

Task #21922: Review https://dev.arvados.org/projects/arvados/wiki/Salt_Installer_FeaturesResolvedLucas Di Pentima07/10/2024Actions
Actions #1

Updated by Peter Amstutz about 1 month ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz about 1 month ago

  • Assigned To set to Lucas Di Pentima
Actions #3

Updated by Lucas Di Pentima 27 days ago

  • Status changed from New to In Progress
Actions #4

Updated by Peter Amstutz 27 days ago

We should evaluate the pros and cons of rewriting parts involving setting up cloud resources in Ansible vs migrating to OpenTofu and integrating them (probably by having Ansible invoke Terraform/OpenTofu as a step). I'm concerned that rewriting that part might be a lot of work for not so much benefit.

Actions #5

Updated by Peter Amstutz 19 days ago

  • Target version changed from Development 2024-07-03 sprint to Development 2024-07-24 sprint
Actions #6

Updated by Lucas Di Pentima 18 days ago

I've written the following wiki page with hopefully a complete set of features we currently have: https://dev.arvados.org/projects/arvados/wiki/Salt_Installer_Features

Actions #7

Updated by Brett Smith 13 days ago

The wiki page says that the arvados-formula repository includes the provision script, but as best I can tell that's mistaken, the provision script lives under arvados/tools/salt-install. The formula README even specifically calls this out.

As we plan a new installer, I hope we can separate what the installer does from how it's implemented. In my opinion, the important thing is that a new installer does all the same tasks. It does not need to do them in exactly the same way. In my opinion, using the same front-end or configuration formats are non-goals. To that end, I think the list of AWS resources created by Terraform, and the mapping of roles to tasks in Salt, are the most valuable things here.

To give an example of why I think the method is less important, the wiki page spends time talking about all the functionality of the provision script. Ansible has a lot of this functionality built-in, so we can rely on that to get the same results without reimplementing it.

  1. Selective deployment: When you run an Ansible playbook, you can limit the run to specific hosts in your inventory using the --limit option. You can name those hosts by DNS name and/or group.
  2. Deployment ordering: Because Ansible playbooks run linearly, you get this for free. Just order your playbook(s) so it works on nodes in the order you want.
  3. Optional use of a jump host: Ansible connects to hosts using OpenSSH by default, so you can just add configuration to ~/.ssh/config directly to achieve this.
  4. Secret vs. non-secret configuration: You can extend your Ansible configuration with encrypted vaults, and there are various storage options for the encryption key. You can safely commit the encrypted configuration to version control while keeping the encryption key out of band.
  5. Pre-run checks: Because Ansible playbooks run linearly, and by default stop execution at the first failure, you can place pre-checks early in a playbook and have them end the run. That said, I don't think you would need a dedicated check for the specific ones we currently do: if SSH connectivity is bad, an Ansible playbook wouldn't even be able to start; and if an SSL certificate was missing from configuration, that would end the playbook run right there.
  6. Cluster diagnostics and test launching: We could provide a separate playbook to do this if desired. I'm not sure how critical it is. Once you've configured OpenSSH to connect to the cluster directly, it's basically just constructing a command for you. It's nice but doesn't seem critical?
Actions #8

Updated by Lucas Di Pentima 12 days ago

Brett Smith wrote in #note-7:

The wiki page says that the arvados-formula repository includes the provision script, but as best I can tell that's mistaken, the provision script lives under arvados/tools/salt-install. The formula README even specifically calls this out.

Whoops, major brain fart right there, sorry! I've fixed the provision.sh introduction, thanks.

As we plan a new installer, I hope we can separate what the installer does from how it's implemented. In my opinion, the important thing is that a new installer does all the same tasks. It does not need to do them in exactly the same way. In my opinion, using the same front-end or configuration formats are non-goals. To that end, I think the list of AWS resources created by Terraform, and the mapping of roles to tasks in Salt, are the most valuable things here.

To give an example of why I think the method is less important, the wiki page spends time talking about all the functionality of the provision script. Ansible has a lot of this functionality built-in, so we can rely on that to get the same results without reimplementing it.

This is surprising to me. My intention was to document the "what" and the "why", but not the "how". I've given the page a quick re-read, and I'm still failing to see where I did explain the "how".
Maybe sometimes I mention from where things come, and those parts are being interpreted as explaining how things are done? While reviewing all the things we've done on the current installer, I realized how dispersed the code is, so I thought it was important to mention what component did which task, but it wasn't my objetive to imply that this is the way the new installer should work, of course.

  1. Selective deployment: When you run an Ansible playbook, you can limit the run to specific hosts in your inventory using the --limit option. You can name those hosts by DNS name and/or group.
    [...]
  2. Pre-run checks: Because Ansible playbooks run linearly, and by default stop execution at the first failure, you can place pre-checks early in a playbook and have them end the run. That said, I don't think you would need a dedicated check for the specific ones we currently do: if SSH connectivity is bad, an Ansible playbook wouldn't even be able to start; and if an SSL certificate was missing from configuration, that would end the playbook run right there.

Yes, I've started reading about Ansible and already knew some features would be useful to replicate those behaviors. Still, I wanted to list them as feature requirements for completeness' sake. I now see that I explained "how" we do rolling upgrades when using multiple controller nodes, I can remove that if it's unnecessary but I think listing that we do rolling upgrades is important.

  1. Cluster diagnostics and test launching: We could provide a separate playbook to do this if desired. I'm not sure how critical it is. Once you've configured OpenSSH to connect to the cluster directly, it's basically just constructing a command for you. It's nice but doesn't seem critical?

I agree that doing the diagnostic checks on the installer is scope-creeping. OTOH, this ticket is about documenting everything we're currently doing. If we decide to leave that part out, that's OK and probably part of an implementation ticket.

In summary, I don't see where I overly documented how things are done, but please feel free to trim the wiki page to a simpler version as you see fit. Do you think it's missing some other features?

Actions #9

Updated by Brett Smith 12 days ago

Lucas Di Pentima wrote in #note-8:

This is surprising to me. My intention was to document the "what" and the "why", but not the "how". I've given the page a quick re-read, and I'm still failing to see where I did explain the "how".

Let me start by saying, I'm fine with the wiki page as it is. I'm more making a comment that I hope feeds into the future planning of a possible Ansible installer.

I think sometimes, especially on the ops side of things, as a team we fall into a habit of thinking "version 1 of our code implements feature X, therefore version 2 of our code must also implement feature X." And while I agree the system as a whole needs to support that feature, it doesn't need to happen in our code if we're building on top of other systems that provide it.

Let's take the optional jumphost support as an example. In today's Salt installer, that work is done in provision.sh: it reads your configuration and uses that to decide whether or not to pass additional options to ssh when connecting to a node.

If we write an Ansible installer, I don't expect us to have any analogous code. I think Ansible install documentation just needs to tell users, if you need to use a jumphost, here's how you configure that in SSH/Ansible. And that should be fine.

The wiki page spends time talking about which jobs happen where. And that's really helpful for being able to follow the code and get a quick tour. But as we plan a potential replacement, I hope we're all on the same page that mirroring a similar structure is not a requirement by itself. I worry that isn't always the case, and that's where I feel like we start getting too wound up in the "how" more about the "what."

Actions #10

Updated by Lucas Di Pentima 12 days ago

Brett Smith wrote in #note-9:

[...]
The wiki page spends time talking about which jobs happen where. And that's really helpful for being able to follow the code and get a quick tour. But as we plan a potential replacement, I hope we're all on the same page that mirroring a similar structure is not a requirement by itself. I worry that isn't always the case, and that's where I feel like we start getting too wound up in the "how" more about the "what."

We're on the same page, using the same structure was never my intention, just documenting what exists right now.

Given that the wiki is in good shape, I'll mark this as resolved. Thanks!

Actions #11

Updated by Lucas Di Pentima 12 days ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF