Project

General

Profile

Salt Installer Features » History » Version 2

Lucas Di Pentima, 07/02/2024 09:25 PM

1 1 Lucas Di Pentima
h2. Introduction
2
3 2 Lucas Di Pentima
To be able to plan for a new Arvados deployment tool, we need to list all the features our current "salt installer" supports. In broad terms what we call the "salt installer" consists of the following parts:
4 1 Lucas Di Pentima
5
h4. The "arvados-formula" salt formula
6
7
Hosted at https://github.com/arvados/arvados-formula, this code is a group of "salt":https://saltproject.io states & pillars that takes care of installing Arvados packages and setting up the services needed to run a cluster. In this repo there's also the "provision script", meant to enable anyone to use the @arvados-formula@ without needing a full-fledged master+minions salt installation. The provision script installs salt in "masterless mode", and it's mostly useful for the single-host use case, where someone needs a complete Arvados cluster running on a single system, for testing purposes.
8
9
h4. The Terraform code
10
11
For multi-host deployments in the cloud (AWS only at the moment), we wrote a set of Terraform files that manage everything from networking, access control, data storage and service nodes resources to speed up the initial setup and be able to quickly modify it once it's deployed. This code outputs a set of useful data that needs to be fed as input to the installer script described below.
12
13
h4. The "installer.sh" script
14
15
In order to easily use the above in a multi-host (e.g.: production) setting, the installer script takes care of setting up a local git repository that holds the installer files, distributing those files to the hosts that will take part of a deployment, and orchestrating the execution of the provision script on each host, each one with their particular configurations. This script heavily relies on search&replace operations using @sed@ that modify templates that will in turn get applied to salt, so it gets complicated to add features when we need to manage 2 level of templating.
16
17
h2. Detailed list of features
18
19
Below is the list of functionality that every part of the installer provides. We aim to list everything that'll be likely needed to be implemented in the new version of the tool. The list of features is written in the order an operator currently handles.
20
21
h3. Terraform deployment
22
23
As suggested in the book "Terraform: Up & Running":https://www.oreilly.com/library/view/terraform-up-and/9781098116736/, the terraform code is explicitly split in several sections to limit the "blast radius" of a potential mistake. The below sections are applied in the described order to build the complete cloud infrastructure needed to install Arvados.
24
25
h4. Networking layer
26
27
# Allows the operator to deploy new or use existing network resources, like VPC, security group & subnets.
28
# Creates an S3 endpoint and route so that keepstore nodes have direct access.
29
# Sets up Internet and NAT gateways to give nodes outbound network access.
30
# Sets up the security group that allows communication between nodes in the VPC, and also inbound SSH & HTTP(S) access.
31
# Manages Route53 domain names from a customizable list of hosts, with an optional split-horizon configuration.
32
# Creates credentials for Let's Encrypt to be able to work with Route53 from the service nodes.
33
# Optionally creates Elastic IP resources for user-facing hosts (controller, workbench).
34
35
h4. Data layer
36
37
# Creates the S3 bucket needed for Keep blocks storage.
38
# Creates keepstore & compute node roles with policies that grants S3 access to the created bucket.
39
40
h4. Service layer
41
42
# Optionally creates an RDS instance as the database service with a sensible set of default values that can be customized.
43
# Creates an AWS secret to hold the TLS certificate private key's decrypting password (for cases where the TLS certificate is provided by the user).
44
# Creates policy and instance profiles so that every service node has access to the above secret.
45
# Creates a policy that gives permissions to compute nodes so that EBS-autoscale filesystems work.
46
# Creates policy, role & instance profile so that the dispatcher node can do its work (launching EC2 instances, listing them, etc.)
47
# Creates the service nodes from the list of hosts names defined in the networking layer, assigning the public IP addresses to the nodes that need them.
48
49
h3. Salt installer
50
51
The Terraform's output data (vpc and subnet ids, various credentials, Route53 domain name servers, etc) gets used by the installer and provision scripts to install & configure the necessary software on each host.
52
53
_TODO: explain node role ordering here_
54
55
There's a "node-to-roles" mapping that is declared as part of the provision script's configuration, each of them described below.
56
57
h4. 'database' role
58
59
Can be overridden to use an external database service (like AWS RDS)
60
61
* Installs a PostgreSQL database server.
62
* Configures PG user & database for Arvados, enabling the @pg_trgm@ extension.
63
* Configures PG server ACLs to allow access from localhost, websocket, keepbalance and controller nodes.
64
* Installs Prometheus node and PG exporters.
65
66
h4. 'controller' role
67
68
* Installs @nginx@, @passenger@ and PG client libraries.
69
** If in "balanced mode", only set up HTTP nginx, as the balancer will act as the TLS termination proxy.
70
* From the @arvados.controller@ & @arvados.api@ formula states
71
** Install rvm if required -- this won't be necessary anymore as we'll be using the distro's provided ruby packages.
72
** Installs @arvados-api-server@, @arvados-controller@
73
** Runs the services and waits up to 2 minutes for the controller service to answer requests, so that Arvados resource creation work in future stages.
74
* If using an external database service, it makes sure the @pg_trgm" extension is enabled.
75
* Sets up @logrotate@ to rotate the RailsAPI's logs daily, keeping the last year of logs. This is because these files are not inside @/var/log/@
76
77
h4. 'monitoring' role
78
79
* Installs & configures Nginx, Prometheus, Node exporter, Blackbox exporter and Grafana.
80
* Nginx configuration details
81
** Sets up basic authentication for the prometheus website (as it doesn't seem to provide its own access controls)
82
** Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
83
* Prometheus configuration details
84
** Sets configurable data retention period
85
** Correctly configures multiple controller nodes in balanced configurations.
86
* Grafana configuration details
87
** Sets up admin user & password with @grafana-cli@
88
** Installs custom dashboards
89
90
h4. 'balancer' role
91
92
* Installs Nginx with a round-robin balanced upstream configuration.
93
* Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
94
95
h4. 'workbench/workbench2' role
96
97
* From @arvados.workbench2@ formula state
98
** Installs @arvados-workbench2@ package
99
* Installs & configures nginx
100
* Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
101
* Uninstalls workbench1 -- this might not be needed in future versions.
102
103
h4. 'webshell' role
104
105
* Installs an nginx virtualhost that uses the shell node's @shellinabox@ service as the upstream.
106
* Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
107
108
h4. 'keepproxy' role
109
110
* From @arvados.keepproxy@ formula state
111
** Installs @arvados-keepproxy@ and runs the service
112
* Installs & configures nginx
113
** Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
114
115
h4. 'keepweb' role
116
117
* From @arvados.keepweb@ formula state
118
** Installs @keep-web@ and runs the service
119
* Installs & configures nginx
120
** Sets up nginx's "download" and "collections" virtualhosts
121
** Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
122
123
h4. 'websocket' role
124
125
* From @arvados.websocket@ formula state
126
** Installs @arvados-ws@ and runs the service
127
* Installs & configures nginx
128
** Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
129
130
h4. ' dispatcher' role
131
132
* From @arvados.dispatcher@ formula state
133
** Installs @arvados-dispatch-cloud@ and runs the service
134
135
h4. 'keepbalance' role
136
137
* From @arvados.keepbalance@ formula state
138
** Installs the @keep-balance@ package and runs the service
139
140
h4. 'keepstore' role
141
142
* From @arvados.keepstore@ formula state
143
** Installs @keepstore@ and runs the service
144
145
h4. 'shell' role
146
147
* Installs @docker@
148
* Installs @sudo@, configures it to allow password-less access to "sudo" group members.
149
* From @arvados.shell@ formula state
150
** Installs @jq@, @arvados-login-sync@, @arvados-client@, @arvados-src@, @libpam-arvados-go@, @python3-arvados-fuse@, @python3-arvados-python-client@, @python3-arvados-cwl-runner@, @python3-crunchstat-summary@ and @shellinabox@
151
** Installs gems: @arvados-cli@, @arvados-login-sync@
152
** Creates a Virtual Machine record for the shell node and sets a scoped 'login' token for it.
153
* Queries the API server for the created virtual machine with the same name as its hostname, and configures cron to run arvados-login-sync with the necessary credentials.
154
155
h4. Default role mapping
156
157
By default the installer deployes a 4-node cluster with only 2 of them needing public IP addresses (in case of a publicly accessible cluster)
158
* Controller node: database & controller roles
159
* Workbench node: monitoring, workbench, workbench2, webshell, keepproxy, keepweb, websocket, dispatcher and keepbalance roles
160
* Keep0 node: keepstore role
161
* Shell node: shell role