Idea #16379: SaltStack install integrates with prometheus/grafana - Arvados

Adds "prometheus" hostname to terraform.
Configures nginx vhost for prometheus on the workbench node.
Installs prometheus using the corresponding salt formula.
Installs postgresql exporter and mtail to monitor the database.
Installs the node exporter to monitor every node's resources.
Sets basic authentication for the prometheus UI website, configurable from local.params with sensible defaults.

The grafana branch in on the way.

Actions

Copy link

#30

Updated by Lucas Di Pentima about 2 years ago

Updates at 4ea265c - branch 16379-installer-grafana

Moves prometheus to its own hostname.
Adds grafana, using the local prometheus as a data source. Uses the same credentials as prometheus for its admin user.
Adds default dashboards (arvados, node exporter, pg exporter) assigned to prometheus data source.

Actions

Copy link

#31

Updated by Peter Amstutz about 2 years ago

Target version changed from Development 2023-03-15 sprint to Development 2023-03-29 Sprint

Actions

Copy link

#32

Updated by Lucas Di Pentima about 2 years ago

Detected some issues when trying a cluster from scratch. I'm working on fixing them.

Actions

Copy link

#33

Updated by Lucas Di Pentima about 2 years ago

Updates at 58006c8a7 - branch 16379-installer-grafana

Fixes the prometheus htpassword file definition to depend on nginx
Fixes keepweb lack of nginx service being applied.
Sets grafana's admin password using the grafana-cli to make sure it's always synced.

Actions

Copy link

#34

Updated by Lucas Di Pentima about 2 years ago

Updates at 4e26d30a7 - branch 16379-installer-prometheus-grafana

Rebased a copy of the 16379-installer-grafana branch (which branched off 16379-installer-prometheus) to 20270-salt-installer-less-instances from #20270 so that it can be manually tested with less AWS resource requirements.
Updates prometheus config to not include keep1 & keepproxy nodes.

Actions

Copy link

#35

Updated by Lucas Di Pentima about 2 years ago

Updates at 247fd765b - branch 16379-installer-prometheus-grafana

Fixes missing monitoring role assignment on the workbench node, that got lost in the rebase.

Actions

Copy link

#36

Updated by Peter Amstutz about 2 years ago

Something's not working with the grafana formula:

----------
          ID: grafana-package-install-pkg-installed
    Function: pkg.installed
        Name: grafana
      Result: False
     Comment: Problem encountered installing package(s). Additional info follows:

              errors:
                  - Running scope as unit: run-rc6b15a6a20094c4981c81382cac47d4d.scope
                    E: Unable to locate package grafana
     Started: 21:48:48.780122
    Duration: 1686.505 ms
     Changes:   
----------
          ID: grafana-service-running-service-running
    Function: service.running
        Name: grafana-server
      Result: False
     Comment: Recursive requisite found
     Changes:

admin@workbench:/$ ls /etc/apt/sources.list.d/
arvados.list  phusionpassenger-official-bullseye.list  salt.list

hmm.

Actions

Copy link

#37

Updated by Peter Amstutz about 2 years ago

Target version changed from Development 2023-03-29 Sprint to Development 2023-04-12 sprint

Actions

Copy link

#38

Updated by Peter Amstutz about 2 years ago

Got past the install issue, had to add the repo to the grafana config.

Now it is failing in post install when it runs "grafana-cli admin reset-admin-password"

                 logger=settings t=2023-03-29T14:02:37.839614504Z level=info msg="Starting Grafana" version= commit= branch= compiled=1970-01-01T00:00:00Z
                  logger=settings t=2023-03-29T14:02:37.839864043Z level=warn msg="\"sentry\" frontend logging provider is deprecated and will be removed in the next major version. Use \"grafana\" provider instead." 
                  logger=settings t=2023-03-29T14:02:37.839911935Z level=info msg="Config loaded from" file=/usr/share/grafana/conf/defaults.ini
                  logger=settings t=2023-03-29T14:02:37.839928606Z level=info msg="Config loaded from" file=/etc/grafana/grafana.ini
                  logger=settings t=2023-03-29T14:02:37.839951907Z level=info msg="Config overridden from command line" arg="default.paths.data=/var/lib/grafana" 
                  logger=settings t=2023-03-29T14:02:37.839971827Z level=info msg="Config overridden from command line" arg="default.paths.logs=/var/log/grafana" 
                  logger=settings t=2023-03-29T14:02:37.839985998Z level=info msg="Config overridden from command line" arg="default.paths.plugins=/var/lib/grafana/plugins" 
                  logger=settings t=2023-03-29T14:02:37.840018929Z level=info msg="Config overridden from command line" arg="default.paths.provisioning=/etc/grafana/provisioning" 
                  logger=settings t=2023-03-29T14:02:37.84003455Z level=info msg="Path Home" path=/usr/share/grafana
                  logger=settings t=2023-03-29T14:02:37.840068241Z level=info msg="Path Data" path=/var/lib/grafana
                  logger=settings t=2023-03-29T14:02:37.840083492Z level=info msg="Path Logs" path=/var/log/grafana
                  logger=settings t=2023-03-29T14:02:37.840115943Z level=info msg="Path Plugins" path=/var/lib/grafana/plugins
                  logger=settings t=2023-03-29T14:02:37.840130503Z level=info msg="Path Provisioning" path=/etc/grafana/provisioning
                  logger=settings t=2023-03-29T14:02:37.840160634Z level=info msg="App mode production" 
                  logger=sqlstore t=2023-03-29T14:02:37.840252308Z level=info msg="Connecting to DB" dbtype=sqlite3
                  logger=migrator t=2023-03-29T14:02:40.15632504Z level=info msg="Starting DB migrations" 
                  logger=migrator t=2023-03-29T14:02:40.334212586Z level=info msg="Executing migration" id="clear migration entry \"remove unified alerting data\"" 
                  logger=migrator t=2023-03-29T14:02:40.965791527Z level=info msg="Executing migration" id="Add column org_id to builtin_role table" 
                  logger=migrator t=2023-03-29T14:02:40.966004875Z level=error msg="Executing migration failed" id="Add column org_id to builtin_role table" error="duplicate column name: org_id" 
                  logger=migrator t=2023-03-29T14:02:40.966178592Z level=error msg="Exec failed" error="duplicate column name: org_id" sql="alter table `builtin_role` ADD COLUMN `org_id` INTEGER NOT NULL DEFAULT 0 " 
                  Error: ✗ failed to initialize runner: migration failed (id = Add column org_id to builtin_role table): duplicate column name: org_id

I did some googling, there's a kind of similar issue here:

https://community.grafana.com/t/error-on-starting-grafana-with-an-empty-database/8614

I'm wondering if we need to do something to initialize the database first.

Actions

Copy link

#39

Updated by Peter Amstutz about 2 years ago

huh, the MONITORING_* variables were missing from my local.params, that might have had something to do with it.

Actions

Copy link

#40

Updated by Peter Amstutz about 2 years ago

I have it working. I still need to tear down the cluster and install from scratch to confirm that it works without any intervention.

However, the "keepstore bandwidth" and "keepstore bytes by type" graphs are still showing "no data" even after the cluster has been up for at least 12 hours.

Actions

Copy link

#41

Updated by Peter Amstutz about 2 years ago

Figured it out. I did a new install from scratch and it looks like the monitoring is now 100% working out of the box. Very exciting.

Actions

Copy link

#42

Updated by Peter Amstutz about 2 years ago

16379-installer-prometheus-grafana @ 41935b6bc7e8bec90e284082c169479e8f02e4cd

Actions

Copy link

#43

Updated by Peter Amstutz about 2 years ago

Related to Feature #20285: System status panel that embeds grafana added

Actions

Copy link

#44

Updated by Peter Amstutz about 2 years ago

So, I think this is working pretty well, the only thing that is mildly annoying is that you have to go hunting for the dashboards instead of them being available on the front page by default. I've added documentation explaining how to find them, but having the installer perform some API calls to "star" them automatically would be even better.

My other thought is it would be great to have a link in the UI, I added #20285

Actions

Copy link

#45

Updated by Lucas Di Pentima about 2 years ago

I have a couple of comments on the new updates:

Would you consider adding **/terraform.tfstate* to .gitignore? I personally don't feel comfortable committing terraform state to a repository that will get distributed to many hosts (shell nodes included).
The terraform-destroy option will fail if the cluster has any data on keep. To avoid confusing error messages, we can either show warning message, or maybe making terraform skip the S3 bucket, by removing that resource from the state before running destroy.

Actions

Copy link

#46

Updated by Peter Amstutz about 2 years ago

Lucas Di Pentima wrote in #note-45:

I have a couple of comments on the new updates:

Would you consider adding **/terraform.tfstate* to .gitignore? I personally don't feel comfortable committing terraform state to a repository that will get distributed to many hosts (shell nodes included).

So, I wanted to make sure tfstate gets added to git because that's exactly the sort of thing you want kept in a live configuration management repo.

Assuming the cloud credentials are not saved (it doesn't look like they are) I don't think there is anything in .tfstate that is any more sensitive than all the other secrets that get distributed (the only other credentials I could find are ones specifically used by Let's Encrypt).

Is the idea simply to avoid having too much information available in case someone cracks the admin account? All the Arvados credentials are already kept in the config file.

One thought I had is to checkout to /tmp and rm -rf the staging repo and checkout after each run so it doesn't hang around.

Longer term, these things could be kept in a secrets management system, but I think we would have to add some features to the config system to retrieve secrets -- but if the node is automatically authorized to fetch secrets it's not clear how different that is from having them on disk?

The terraform-destroy option will fail if the cluster has any data on keep. To avoid confusing error messages, we can either show warning message, or maybe making terraform skip the S3 bucket, by removing that resource from the state before running destroy.

I think you do get back an error that mentions that you can't delete buckets unless they are empty. I added terraform-destroy because it was useful for testing. I deleted the bucket contents manually from the AWS console. That wouldn't work if there were more than a 1000 blocks, though. I don't know if there is an API (or AWS CLI command) to mass delete S3 objects or if you have to go through them one by one.

Actions

Copy link

#47

Updated by Lucas Di Pentima about 2 years ago

Peter Amstutz wrote in #note-46:

So, I wanted to make sure tfstate gets added to git because that's exactly the sort of thing you want kept in a live configuration management repo.

Yes, but what we have right now is a mix of things. I didn't want to make an ugly situation a bit worse, but if we'll fix this properly in the future, I'm OK with adding more secret data to what we already distribute. In an ideal situation, we would not use "masterless salt". We would just use salt with a central server that distributes just the data every node needs instead of everything.

I think you do get back an error that mentions that you can't delete buckets unless they are empty. I added terraform-destroy because it was useful for testing. I deleted the bucket contents manually from the AWS console. That wouldn't work if there were more than a 1000 blocks, though. I don't know if there is an API (or AWS CLI command) to mass delete S3 objects or if you have to go through them one by one.

Ok then, I'll merge the branch as is.

Actions

Copy link

#48

Updated by Lucas Di Pentima about 2 years ago

Status changed from In Progress to Resolved

Applied in changeset arvados|98969e546c909ac2ee4256934b5339080598d252.

Related to Arvados - Feature #16213: Default metrics dashboard for grafana	Resolved			Actions
Related to Arvados Epics - Idea #16428: Metrics dashboard	Resolved	02/01/2023	04/30/2023	Actions
Related to Arvados - Feature #20285: System status panel that embeds grafana	New			Actions

Project

General

Profile

Arvados

Custom queries

Watchers (1)

Idea #16379

SaltStack install integrates with prometheus/grafana

Updated by Peter Amstutz almost 5 years ago

Updated by Peter Amstutz almost 5 years ago

Updated by Peter Amstutz almost 5 years ago

Updated by Lucas Di Pentima over 2 years ago

Updated by Lucas Di Pentima over 2 years ago

Updated by Lucas Di Pentima over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Lucas Di Pentima over 2 years ago

Updated by Lucas Di Pentima over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Lucas Di Pentima over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz over 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Peter Amstutz about 2 years ago

Updated by Lucas Di Pentima about 2 years ago

Updated by Lucas Di Pentima about 2 years ago