Salt Installer Features » History » Revision 11
« Previous |
Revision 11/17
(diff)
| Next »
Lucas Di Pentima, 07/03/2024 08:59 PM
- Table of contents
- Introduction
- Detailed list of features
Introduction¶
To be able to plan for a new Arvados deployment tool, we need to list all the features our current "salt installer" supports. In broad terms what we call the "salt installer" consists of the following parts:
The "arvados-formula" salt formula¶
Hosted at https://github.com/arvados/arvados-formula, this code is a group of salt states & pillars that takes care of installing Arvados packages and setting up the services needed to run a cluster. In this repo there's also the "provision script", meant to enable anyone to use the arvados-formula
without needing a full-fledged master+minions salt installation. The provision script installs salt in "masterless mode", and it's mostly useful for the single-host use case, where someone needs a complete Arvados cluster running on a single system, for testing purposes.
The Terraform code¶
For multi-host deployments in the cloud (AWS only at the moment), we wrote a set of Terraform files that manage everything from networking, access control, data storage and service nodes resources to speed up the initial setup and be able to quickly modify it once it's deployed. This code outputs a set of useful data that needs to be fed as input to the installer script described below.
The "installer.sh" script¶
In order to easily use the above in a multi-host (e.g.: production) setting, the installer script takes care of setting up a local git repository that holds the installer files, distributing those files to the hosts that will take part of a deployment, and orchestrating the execution of the provision script on each host, each one with their particular configurations. This script heavily relies on search&replace operations using sed
that modify templates that will in turn get applied to salt, so it gets complicated to add features when we need to manage 2 level of templating.
Detailed list of features¶
Below is the list of functionality that every part of the installer provides. We aim to list everything that'll be likely needed to be implemented in the new version of the tool. The list of features is written in the order an operator currently handles.
Terraform deployment¶
As suggested in the book Terraform: Up & Running, the terraform code is explicitly split in several sections to limit the "blast radius" of a potential mistake. The below sections are applied in the described order to build the complete cloud infrastructure needed to install Arvados.
Networking layer¶
- Allows the operator to deploy new or use existing network resources, like VPC, security group & subnets.
- Creates an S3 endpoint and route so that keepstore nodes have direct access.
- Sets up Internet and NAT gateways to give nodes outbound network access.
- Sets up the security group that allows communication between nodes in the VPC, and also inbound SSH & HTTP access.
- Manages Route53 domain names from a customizable list of hosts, with an optional split-horizon configuration.
- Creates credentials for Let's Encrypt to be able to work with Route53 from the service nodes.
- Optionally creates Elastic IP resources for user-facing hosts (controller, workbench).
Input parameters¶
These are optional if not explicitly stated as required.- AWS region (required)
- Cluster prefix (required)
- Domain name (required)
- "Private only" flag
- VPC, security group, public and private subnet IDs
- "Use RDS" flag
- RDS additional subnet ID
- List of user facing service node names
- List of internal service node names
- Node name to private IP address map
- DNS alias records to node name map
Data layer¶
- Creates the S3 bucket needed for Keep blocks storage.
- Creates keepstore & compute node roles with policies that grants S3 access to the created bucket.
Input parameters¶
- "Use external DB" flag -- Not really used by anything, but including it for completeness' sake.
Service layer¶
- Optionally creates an RDS instance as the database service with a sensible set of default values that can be customized.
- Creates an AWS secret to hold the TLS certificate private key's decrypting password (for cases where the TLS certificate is provided by the user).
- Creates policy and instance profiles so that every service node has access to the above secret.
- Creates a policy that gives permissions to compute nodes so that EBS-autoscale filesystems work.
- Creates policy, role & instance profile so that the dispatcher node can do its work (launching EC2 instances, listing them, etc.)
- Creates the service nodes from the list of hosts names defined in the networking layer, assigning the public IP addresses to the nodes that need them.
Input parameters¶
These are optional if not explicitly stated as required.- SSH public key file path: so that the installer script can log into the nodes without password.
- Node name to Instance type map
- Node name to volume size map
- "Use RDS" flag
- RDS username & password, instance type, version, allocated and max storage size, backup retention period, backup before deletion and final backup name parameters.
- TLS certificate private key decryption password secret name prefix
- Username for deployment
- Instance AMI
Installer script¶
The installer.sh
script provides a handful of useful features, some of which will be needed in some form on the new tool as they are not aimed to mitigate salt shortcomings but necessary in some or all styles of deployments.
- Selective deployment: Sometimes doing a quick update on a single node is enough.
- Deployment ordering: when doing a full deploy run, some nodes need to be updated before others, the current ordering scheme is:
- Database node
- Controller node(s): To be able to perform rolling updates on balanced controllers deployments, it removes the controller node about to be updated from the balancer's pool on each iteration.
- Balancer node (if exists)
- Everything else
- Optional use of a jump host: In some situations, using a reachable jump host is needed for the installer to be able to connect to internal cluster nodes like the database, shell or even keepstore. This will depend on whether the installer is run from the same network as the cluster or from the outside.
- Secret vs Non-secret configuration handling: Secret config data include cluster's default admin account password, database credentials, dispatcher's private SSH key, etc. These need to be separate from the rest of the configuration parameters so that they can be placed on secure storages if needed.
- General sanity checks: The installer script does some checks previous to a deploy run, like:
- Node connectivity and SSH access.
- TLS certificate existence when not using Let's Encrypt
- Cluster Diagnostics test launching: To confirm everything is working correctly, it runs
arvados-client diagnostics
from the local host or the shell node.
Salt installer¶
The Terraform's output data (vpc and subnet ids, various credentials, Route53 domain name servers, etc) gets used by the installer and provision scripts to install & configure the necessary software on each host.
There's a "node-to-roles" mapping that is declared as part of the provision script's configuration, each of them described below.
'database' role¶
Can be overridden to use an external database service (like AWS RDS)
- Installs a PostgreSQL database server.
- Configures PG user & database for Arvados, enabling the
pg_trgm
extension. - Configures PG server ACLs to allow access from localhost, websocket, keepbalance and controller nodes.
- Installs Prometheus node and PG exporters.
'controller' role¶
- Installs
nginx
,passenger
and PG client libraries.- If in "balanced mode", only set up HTTP nginx, as the balancer will act as the TLS termination proxy.
- From the
arvados.controller
&arvados.api
formula states- Install rvm if required -- this won't be necessary anymore as we'll be using the distro's provided ruby packages.
- Installs
arvados-api-server
,arvados-controller
- Runs the services and waits up to 2 minutes for the controller service to answer requests, so that Arvados resource creation work in future stages.
- If using an external database service, it makes sure the @pg_trgm" extension is enabled.
- Sets up
logrotate
to rotate the RailsAPI's logs daily, keeping the last year of logs. This is because these files are not inside/var/log/
'monitoring' role¶
- Installs & configures Nginx, Prometheus, Node exporter, Blackbox exporter and Grafana.
- Nginx configuration details
- Sets up basic authentication for the prometheus website (as it doesn't seem to provide its own access controls)
- Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
- Prometheus configuration details
- Sets configurable data retention period
- Correctly configures multiple controller nodes in balanced configurations.
- Grafana configuration details
- Sets up admin user & password with
grafana-cli
- Installs custom dashboards
- Sets up admin user & password with
'balancer' role¶
- Installs Nginx with a round-robin balanced upstream configuration.
- Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
'workbench/workbench2' role¶
- From
arvados.workbench2
formula state- Installs
arvados-workbench2
package
- Installs
- Installs & configures nginx
- Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
- Uninstalls workbench1 -- this might not be needed in future versions.
'webshell' role¶
- Installs an nginx virtualhost that uses the shell node's
shellinabox
service as the upstream. - Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
'keepproxy' role¶
- From
arvados.keepproxy
formula state- Installs
arvados-keepproxy
and runs the service
- Installs
- Installs & configures nginx
- Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
'keepweb' role¶
- From
arvados.keepweb
formula state- Installs
keep-web
and runs the service
- Installs
- Installs & configures nginx
- Sets up nginx's "download" and "collections" virtualhosts
- Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
'websocket' role¶
- From
arvados.websocket
formula state- Installs
arvados-ws
and runs the service
- Installs
- Installs & configures nginx
- Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
' dispatcher' role¶
- From
arvados.dispatcher
formula state- Installs
arvados-dispatch-cloud
and runs the service
- Installs
'keepbalance' role¶
- From
arvados.keepbalance
formula state- Installs the
keep-balance
package and runs the service
- Installs the
'keepstore' role¶
- From
arvados.keepstore
formula state- Installs
keepstore
and runs the service
- Installs
'shell' role¶
- Installs
docker
- Installs
sudo
, configures it to allow password-less access to "sudo" group members. - From
arvados.shell
formula state- Installs
jq
,arvados-login-sync
,arvados-client
,arvados-src
,libpam-arvados-go
,python3-arvados-fuse
,python3-arvados-python-client
,python3-arvados-cwl-runner
,python3-crunchstat-summary
andshellinabox
- Installs gems:
arvados-cli
,arvados-login-sync
- Creates a Virtual Machine record for the shell node and sets a scoped 'login' token for it.
- Installs
- Queries the API server for the created virtual machine with the same name as its hostname, and configures cron to run arvados-login-sync with the necessary credentials.
Default role mapping¶
By default the installer deploys a 4-node cluster with only 2 of them needing public IP addresses (in case of a publicly accessible cluster)- Controller node:
database
&controller
roles - Workbench node:
monitoring
,workbench
,workbench2
,webshell
,keepproxy
,keepweb
,websocket
,dispatcher
andkeepbalance
roles - Keep0 node:
keepstore
role - Shell node:
shell
role
Updated by Lucas Di Pentima 4 months ago · 11 revisions