Project

General

Profile

Salt Installer Features » History » Version 17

Lucas Di Pentima, 07/10/2024 01:18 PM
Fixes provision.sh script description

1 5 Lucas Di Pentima
{{>toc}}
2
3 1 Lucas Di Pentima
h2. Introduction
4
5 2 Lucas Di Pentima
To be able to plan for a new Arvados deployment tool, we need to list all the features our current "salt installer" supports. In broad terms what we call the "salt installer" consists of the following parts:
6 1 Lucas Di Pentima
7 15 Lucas Di Pentima
h3. The "arvados-formula" salt formula
8 1 Lucas Di Pentima
9 17 Lucas Di Pentima
Hosted at https://github.com/arvados/arvados-formula, this code is a group of "salt":https://saltproject.io states & pillars that takes care of installing Arvados packages and setting up the services needed to run a cluster.
10 1 Lucas Di Pentima
11 17 Lucas Di Pentima
h3. The "provision.sh" script
12
13
The "provision script" is meant to enable the use of the @arvados-formula@ without needing a full-fledged master+minions salt installation. The provision script installs salt in "masterless mode", and it's mostly useful for the single-host use case, where someone needs a complete Arvados cluster running on a single system, for testing purposes.
14
15 15 Lucas Di Pentima
h3. The Terraform code
16 1 Lucas Di Pentima
17
For multi-host deployments in the cloud (AWS only at the moment), we wrote a set of Terraform files that manage everything from networking, access control, data storage and service nodes resources to speed up the initial setup and be able to quickly modify it once it's deployed. This code outputs a set of useful data that needs to be fed as input to the installer script described below.
18
19 15 Lucas Di Pentima
h3. The "installer.sh" script
20 1 Lucas Di Pentima
21
In order to easily use the above in a multi-host (e.g.: production) setting, the installer script takes care of setting up a local git repository that holds the installer files, distributing those files to the hosts that will take part of a deployment, and orchestrating the execution of the provision script on each host, each one with their particular configurations. This script heavily relies on search&replace operations using @sed@ that modify templates that will in turn get applied to salt, so it gets complicated to add features when we need to manage 2 level of templating.
22
23
h2. Detailed list of features
24
25
Below is the list of functionality that every part of the installer provides. We aim to list everything that'll be likely needed to be implemented in the new version of the tool. The list of features is written in the order an operator currently handles.
26
27
h3. Terraform deployment
28
29
As suggested in the book "Terraform: Up & Running":https://www.oreilly.com/library/view/terraform-up-and/9781098116736/, the terraform code is explicitly split in several sections to limit the "blast radius" of a potential mistake. The below sections are applied in the described order to build the complete cloud infrastructure needed to install Arvados.
30
31
h4. Networking layer
32
33
# Allows the operator to deploy new or use existing network resources, like VPC, security group & subnets.
34
# Creates an S3 endpoint and route so that keepstore nodes have direct access.
35
# Sets up Internet and NAT gateways to give nodes outbound network access.
36
# Sets up the security group that allows communication between nodes in the VPC, and also inbound SSH & HTTP(S) access.
37
# Manages Route53 domain names from a customizable list of hosts, with an optional split-horizon configuration.
38
# Creates credentials for Let's Encrypt to be able to work with Route53 from the service nodes.
39
# Optionally creates Elastic IP resources for user-facing hosts (controller, workbench).
40
41 8 Lucas Di Pentima
h5. Input parameters
42
43 11 Lucas Di Pentima
_These are optional if not explicitly stated as required._
44 8 Lucas Di Pentima
* AWS region (required)
45
* Cluster prefix (required)
46
* Domain name (required)
47
* "Private only" flag
48
* VPC, security group, public and private subnet IDs
49
* "Use RDS" flag
50
* RDS additional subnet ID
51
* List of user facing service node names
52
* List of internal service node names
53
* Node name to private IP address map
54
* DNS alias records to node name map
55
56 1 Lucas Di Pentima
h4. Data layer
57
58
# Creates the S3 bucket needed for Keep blocks storage.
59
# Creates keepstore & compute node roles with policies that grants S3 access to the created bucket.
60
61 9 Lucas Di Pentima
h5. Input parameters
62
63
* "Use external DB" flag -- Not really used by anything, but including it for completeness' sake.
64
65 1 Lucas Di Pentima
h4. Service layer
66
67
# Optionally creates an RDS instance as the database service with a sensible set of default values that can be customized.
68
# Creates an AWS secret to hold the TLS certificate private key's decrypting password (for cases where the TLS certificate is provided by the user).
69
# Creates policy and instance profiles so that every service node has access to the above secret.
70
# Creates a policy that gives permissions to compute nodes so that EBS-autoscale filesystems work.
71
# Creates policy, role & instance profile so that the dispatcher node can do its work (launching EC2 instances, listing them, etc.)
72
# Creates the service nodes from the list of hosts names defined in the networking layer, assigning the public IP addresses to the nodes that need them.
73
74 10 Lucas Di Pentima
h5. Input parameters
75
76
_These are optional if not explicitly stated as required._
77
* SSH public key file path: so that the installer script can log into the nodes without password.
78
* Node name to Instance type map
79
* Node name to volume size map
80
* "Use RDS" flag
81
* RDS username & password, instance type, version, allocated and max storage size, backup retention period, backup before deletion and final backup name parameters.
82
* TLS certificate private key decryption password secret name prefix
83
* Username for deployment
84
* Instance AMI
85
86 4 Lucas Di Pentima
h3. Installer script
87 3 Lucas Di Pentima
88
The @installer.sh@ script provides a handful of useful features, some of which will be needed in some form on the new tool as they are not aimed to mitigate salt shortcomings but necessary in some or all styles of deployments.
89 1 Lucas Di Pentima
90 4 Lucas Di Pentima
# *Selective deployment:* Sometimes doing a quick update on a single node is enough.
91
# *Deployment ordering:* when doing a full deploy run, some nodes need to be updated before others, the current ordering scheme is:
92
## Database node
93
## Controller node(s): To be able to perform rolling updates on balanced controllers deployments, it removes the controller node about to be updated from the balancer's pool on each iteration.
94
## Balancer node (if exists)
95
## Everything else
96
# *Optional use of a jump host:* In some situations, using a reachable jump host is needed for the installer to be able to connect to internal cluster nodes like the database, shell or even keepstore. This will depend on whether the installer is run from the same network as the cluster or from the outside.
97
# *Secret vs Non-secret configuration handling:* Secret config data include cluster's default admin account password, database credentials, dispatcher's private SSH key, etc. These need to be separate from the rest of the configuration parameters so that they can be placed on secure storages if needed.
98
# *General sanity checks:* The installer script does some checks previous to a deploy run, like:
99
## Node connectivity and SSH access.
100
## TLS certificate existence when not using Let's Encrypt
101
# *Cluster Diagnostics test launching:* To confirm everything is working correctly, it runs @arvados-client diagnostics@ from the local host or the shell node.
102 3 Lucas Di Pentima
103 12 Lucas Di Pentima
h4. Input parameters
104
105
106
h5. Config parameters
107
108
_These have default values if not explicitly stated as required._
109
* Cluster prefix & domain (required -- should be taken from terraform's output)
110
* Username for deployment
111
* Arvados admin's username
112
* Arvados admin's email (required)
113
* Use SSH jumphost
114
* AWS region (required -- should be taken from terraform's output)
115
* SSL mode
116
* "Use Let's Encrypt with Route53" flag
117
* Let's Encrypt AWS region (doesn't seem to be used, we should double check)
118
* Compute AMI ID (required)
119
* Compute nodes security group (required -- should be taken from terraform's output)
120
* Compute nodes subnet ID (required -- should be taken from terraform's output)
121
* Compute node AWS region
122
* Compute node username (the one that the dispatcher will use to control the node)
123
* Keep S3 AWS region
124
* Keep S3 Bucket name
125
* Keepstore IAM role
126
* "Is TLS privkey encrypted?" flag
127
* TLS privkey decryption password secret name
128
* TLS privkey decryption password secret AWS region
129
* Prometheus & Grafana UI access user name & email
130
* Prometheus data retention time
131
* Node-to-roles mapping
132
* Arvados services external TLS ports
133
* Cluster internal CIDR
134
* Arvados services internal IP addresses
135
* Arvados database name
136
* Arvados database user name
137
* External database service host name or IP address
138
* Database version
139
* Controller's max workers
140
* Controller's request queue size
141
* Controllers max gateway tunnels
142
* Arvados release (production/development)
143
* Arvados version (latest or specific)
144
145
h5. Secret parameters (all required)
146
147
* Arvados admin's password
148
* Prometheus & Grafana UI access user's password
149
* Arvados Blob signing key
150
* Arvados management token
151
* Arvados system root token
152
* Arvados anonymous user token
153
* Database password
154
* Let's Encrypt access key ID & secret
155
* Arvados dispatcher's SSH private key
156
157 1 Lucas Di Pentima
h3. Salt installer
158
159
The Terraform's output data (vpc and subnet ids, various credentials, Route53 domain name servers, etc) gets used by the installer and provision scripts to install & configure the necessary software on each host.
160
161 13 Lucas Di Pentima
h4. TLS certificate encrypted private key handling
162
163
On those nodes with services accepting requests through nginx as a TLS proxy, if the TLS certificate private key is encrypted with a password, it installs a series of scripts that read the configured AWS secret and feeds a named pipe file inside @/run/arvados/@ with its contents, so that nginx can read the password at startup time.
164
165 14 Lucas Di Pentima
h4. Service node roles
166 1 Lucas Di Pentima
167 16 Lucas Di Pentima
There's a "node-to-roles" mapping that is declared as part of the installer script's configuration, each of them described below:
168 1 Lucas Di Pentima
169 14 Lucas Di Pentima
h5. 'database' role
170
171 1 Lucas Di Pentima
Can be overridden to use an external database service (like AWS RDS)
172
173
* Installs a PostgreSQL database server.
174
* Configures PG user & database for Arvados, enabling the @pg_trgm@ extension.
175
* Configures PG server ACLs to allow access from localhost, websocket, keepbalance and controller nodes.
176
* Installs Prometheus node and PG exporters.
177
178 14 Lucas Di Pentima
h5. 'controller' role
179 1 Lucas Di Pentima
180
* Installs @nginx@, @passenger@ and PG client libraries.
181
** If in "balanced mode", only set up HTTP nginx, as the balancer will act as the TLS termination proxy.
182
* From the @arvados.controller@ & @arvados.api@ formula states
183
** Install rvm if required -- this won't be necessary anymore as we'll be using the distro's provided ruby packages.
184
** Installs @arvados-api-server@, @arvados-controller@
185
** Runs the services and waits up to 2 minutes for the controller service to answer requests, so that Arvados resource creation work in future stages.
186 17 Lucas Di Pentima
* If using an external database service, it makes sure the @pg_trgm@ extension is enabled.
187 1 Lucas Di Pentima
* Sets up @logrotate@ to rotate the RailsAPI's logs daily, keeping the last year of logs. This is because these files are not inside @/var/log/@
188
189 14 Lucas Di Pentima
h5. 'monitoring' role
190 1 Lucas Di Pentima
191
* Installs & configures Nginx, Prometheus, Node exporter, Blackbox exporter and Grafana.
192
* Nginx configuration details
193
** Sets up basic authentication for the prometheus website (as it doesn't seem to provide its own access controls)
194
** Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
195
* Prometheus configuration details
196
** Sets configurable data retention period
197
** Correctly configures multiple controller nodes in balanced configurations.
198
* Grafana configuration details
199
** Sets up admin user & password with @grafana-cli@
200
** Installs custom dashboards
201
202 14 Lucas Di Pentima
h5. 'balancer' role
203 1 Lucas Di Pentima
204
* Installs Nginx with a round-robin balanced upstream configuration.
205
* Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
206
207 14 Lucas Di Pentima
h5. 'workbench/workbench2' role
208 1 Lucas Di Pentima
209
* From @arvados.workbench2@ formula state
210
** Installs @arvados-workbench2@ package
211
* Installs & configures nginx
212
* Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
213
* Uninstalls workbench1 -- this might not be needed in future versions.
214
215 14 Lucas Di Pentima
h5. 'webshell' role
216 1 Lucas Di Pentima
217
* Installs an nginx virtualhost that uses the shell node's @shellinabox@ service as the upstream.
218
* Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
219
220 14 Lucas Di Pentima
h5. 'keepproxy' role
221 1 Lucas Di Pentima
222
* From @arvados.keepproxy@ formula state
223
** Installs @arvados-keepproxy@ and runs the service
224
* Installs & configures nginx
225
** Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
226
227 14 Lucas Di Pentima
h5. 'keepweb' role
228 1 Lucas Di Pentima
229
* From @arvados.keepweb@ formula state
230
** Installs @keep-web@ and runs the service
231
* Installs & configures nginx
232
** Sets up nginx's "download" and "collections" virtualhosts
233
** Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
234
235 14 Lucas Di Pentima
h5. 'websocket' role
236 1 Lucas Di Pentima
237
* From @arvados.websocket@ formula state
238
** Installs @arvados-ws@ and runs the service
239
* Installs & configures nginx
240
** Sets up custom TLS certs or installs Let's Encrypt to manage them, depending on configuration.
241
242 14 Lucas Di Pentima
h5. ' dispatcher' role
243 1 Lucas Di Pentima
244
* From @arvados.dispatcher@ formula state
245
** Installs @arvados-dispatch-cloud@ and runs the service
246
247 14 Lucas Di Pentima
h5. 'keepbalance' role
248 1 Lucas Di Pentima
249
* From @arvados.keepbalance@ formula state
250
** Installs the @keep-balance@ package and runs the service
251
252 14 Lucas Di Pentima
h5. 'keepstore' role
253 1 Lucas Di Pentima
254
* From @arvados.keepstore@ formula state
255
** Installs @keepstore@ and runs the service
256
257 14 Lucas Di Pentima
h5. 'shell' role
258 1 Lucas Di Pentima
259
* Installs @docker@
260
* Installs @sudo@, configures it to allow password-less access to "sudo" group members.
261
* From @arvados.shell@ formula state
262
** Installs @jq@, @arvados-login-sync@, @arvados-client@, @arvados-src@, @libpam-arvados-go@, @python3-arvados-fuse@, @python3-arvados-python-client@, @python3-arvados-cwl-runner@, @python3-crunchstat-summary@ and @shellinabox@
263
** Installs gems: @arvados-cli@, @arvados-login-sync@
264
** Creates a Virtual Machine record for the shell node and sets a scoped 'login' token for it.
265
* Queries the API server for the created virtual machine with the same name as its hostname, and configures cron to run arvados-login-sync with the necessary credentials.
266
267 14 Lucas Di Pentima
h5. Default role mapping
268 1 Lucas Di Pentima
269 7 Lucas Di Pentima
By default the installer deploys a 4-node cluster with only 2 of them needing public IP addresses (in case of a publicly accessible cluster)
270 6 Lucas Di Pentima
* Controller node: @database@ & @controller@ roles
271
* Workbench node: @monitoring@, @workbench@, @workbench2@, @webshell@, @keepproxy@, @keepweb@, @websocket@, @dispatcher@ and @keepbalance@ roles
272
* Keep0 node: @keepstore@ role
273
* Shell node: @shell@ role