Project

General

Profile

Actions

Idea #16379

closed

SaltStack install integrates with prometheus/grafana

Added by Peter Amstutz almost 4 years ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Category:
Deployment
Story points:
4.0
Release relationship:
Auto

Description

When using SaltStack for configuration management, admin can choose to use SaltStack formulas for prometheus and grafana. The Arvados formula integrates with them by providing prometheus config file and adding Arvados dashboard (#16213) to grafana.


Subtasks 1 (0 open1 closed)

Task #19509: Review 16379-installer-prometheus-grafanaResolvedPeter Amstutz03/03/2023Actions

Related issues

Related to Arvados - Feature #16213: Default metrics dashboard for grafanaResolvedActions
Related to Arvados Epics - Idea #16428: Metrics dashboardResolved02/01/202304/30/2023Actions
Related to Arvados - Feature #20285: System status panel that embeds grafanaNewActions
Actions #1

Updated by Peter Amstutz almost 4 years ago

  • Description updated (diff)
Actions #2

Updated by Peter Amstutz almost 4 years ago

  • Related to Feature #16213: Default metrics dashboard for grafana added
Actions #4

Updated by Peter Amstutz almost 4 years ago

Actions #6

Updated by Lucas Di Pentima over 1 year ago

  • Target version set to 2022-09-14 sprint
  • Assigned To set to Lucas Di Pentima
Actions #7

Updated by Lucas Di Pentima over 1 year ago

  • Status changed from New to In Progress
Actions #8

Updated by Lucas Di Pentima over 1 year ago

  • Status changed from In Progress to New
Actions #9

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Actions #10

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Actions #11

Updated by Lucas Di Pentima over 1 year ago

  • Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Actions #12

Updated by Lucas Di Pentima over 1 year ago

  • Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Actions #13

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Actions #14

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Actions #15

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-12-07 Sprint to 2022-12-21 Sprint
Actions #16

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-12-21 Sprint to 2023-01-18 sprint
Actions #17

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2023-01-18 sprint to 2022-12-21 Sprint
Actions #18

Updated by Lucas Di Pentima over 1 year ago

  • Story points set to 4.0
Actions #19

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2022-12-21 Sprint to 2023-01-18 sprint
Actions #20

Updated by Peter Amstutz over 1 year ago

  • Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Actions #21

Updated by Peter Amstutz over 1 year ago

  • Release set to 59
  • Target version deleted (2023-02-01 sprint)
Actions #22

Updated by Peter Amstutz over 1 year ago

  • Target version set to To be scheduled
Actions #23

Updated by Peter Amstutz about 1 year ago

  • Target version changed from To be scheduled to 2023-02-01 sprint
Actions #24

Updated by Lucas Di Pentima about 1 year ago

  • Status changed from New to In Progress
Actions #25

Updated by Lucas Di Pentima about 1 year ago

  • Target version changed from 2023-02-01 sprint to 2023-02-15 sprint
Actions #26

Updated by Peter Amstutz about 1 year ago

  • Release set to 57
Actions #27

Updated by Lucas Di Pentima about 1 year ago

  • Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Actions #28

Updated by Lucas Di Pentima about 1 year ago

  • Target version changed from 2023-03-01 sprint to Development 2023-03-15 sprint
Actions #29

Updated by Lucas Di Pentima about 1 year ago

Updates at a8ceae6 - branch 16379-installer-prometheus

  • Adds "prometheus" hostname to terraform.
  • Configures nginx vhost for prometheus on the workbench node.
  • Installs prometheus using the corresponding salt formula.
  • Installs postgresql exporter and mtail to monitor the database.
  • Installs the node exporter to monitor every node's resources.
  • Sets basic authentication for the prometheus UI website, configurable from local.params with sensible defaults.

The grafana branch in on the way.

Actions #30

Updated by Lucas Di Pentima about 1 year ago

Updates at 4ea265c - branch 16379-installer-grafana

  • Moves prometheus to its own hostname.
  • Adds grafana, using the local prometheus as a data source. Uses the same credentials as prometheus for its admin user.
  • Adds default dashboards (arvados, node exporter, pg exporter) assigned to prometheus data source.
Actions #31

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-03-15 sprint to Development 2023-03-29 Sprint
Actions #32

Updated by Lucas Di Pentima about 1 year ago

Detected some issues when trying a cluster from scratch. I'm working on fixing them.

Actions #33

Updated by Lucas Di Pentima about 1 year ago

Updates at 58006c8a7 - branch 16379-installer-grafana

  • Fixes the prometheus htpassword file definition to depend on nginx
  • Fixes keepweb lack of nginx service being applied.
  • Sets grafana's admin password using the grafana-cli to make sure it's always synced.
Actions #34

Updated by Lucas Di Pentima about 1 year ago

Updates at 4e26d30a7 - branch 16379-installer-prometheus-grafana

  • Rebased a copy of the 16379-installer-grafana branch (which branched off 16379-installer-prometheus) to 20270-salt-installer-less-instances from #20270 so that it can be manually tested with less AWS resource requirements.
  • Updates prometheus config to not include keep1 & keepproxy nodes.
Actions #35

Updated by Lucas Di Pentima about 1 year ago

Updates at 247fd765b - branch 16379-installer-prometheus-grafana

  • Fixes missing monitoring role assignment on the workbench node, that got lost in the rebase.
Actions #36

Updated by Peter Amstutz about 1 year ago

Something's not working with the grafana formula:

----------
          ID: grafana-package-install-pkg-installed
    Function: pkg.installed
        Name: grafana
      Result: False
     Comment: Problem encountered installing package(s). Additional info follows:

              errors:
                  - Running scope as unit: run-rc6b15a6a20094c4981c81382cac47d4d.scope
                    E: Unable to locate package grafana
     Started: 21:48:48.780122
    Duration: 1686.505 ms
     Changes:   
----------
          ID: grafana-service-running-service-running
    Function: service.running
        Name: grafana-server
      Result: False
     Comment: Recursive requisite found
     Changes:   
admin@workbench:/$ ls /etc/apt/sources.list.d/
arvados.list  phusionpassenger-official-bullseye.list  salt.list

hmm.

Actions #37

Updated by Peter Amstutz about 1 year ago

  • Target version changed from Development 2023-03-29 Sprint to Development 2023-04-12 sprint
Actions #38

Updated by Peter Amstutz about 1 year ago

Got past the install issue, had to add the repo to the grafana config.

Now it is failing in post install when it runs "grafana-cli admin reset-admin-password"

                 logger=settings t=2023-03-29T14:02:37.839614504Z level=info msg="Starting Grafana" version= commit= branch= compiled=1970-01-01T00:00:00Z
                  logger=settings t=2023-03-29T14:02:37.839864043Z level=warn msg="\"sentry\" frontend logging provider is deprecated and will be removed in the next major version. Use \"grafana\" provider instead." 
                  logger=settings t=2023-03-29T14:02:37.839911935Z level=info msg="Config loaded from" file=/usr/share/grafana/conf/defaults.ini
                  logger=settings t=2023-03-29T14:02:37.839928606Z level=info msg="Config loaded from" file=/etc/grafana/grafana.ini
                  logger=settings t=2023-03-29T14:02:37.839951907Z level=info msg="Config overridden from command line" arg="default.paths.data=/var/lib/grafana" 
                  logger=settings t=2023-03-29T14:02:37.839971827Z level=info msg="Config overridden from command line" arg="default.paths.logs=/var/log/grafana" 
                  logger=settings t=2023-03-29T14:02:37.839985998Z level=info msg="Config overridden from command line" arg="default.paths.plugins=/var/lib/grafana/plugins" 
                  logger=settings t=2023-03-29T14:02:37.840018929Z level=info msg="Config overridden from command line" arg="default.paths.provisioning=/etc/grafana/provisioning" 
                  logger=settings t=2023-03-29T14:02:37.84003455Z level=info msg="Path Home" path=/usr/share/grafana
                  logger=settings t=2023-03-29T14:02:37.840068241Z level=info msg="Path Data" path=/var/lib/grafana
                  logger=settings t=2023-03-29T14:02:37.840083492Z level=info msg="Path Logs" path=/var/log/grafana
                  logger=settings t=2023-03-29T14:02:37.840115943Z level=info msg="Path Plugins" path=/var/lib/grafana/plugins
                  logger=settings t=2023-03-29T14:02:37.840130503Z level=info msg="Path Provisioning" path=/etc/grafana/provisioning
                  logger=settings t=2023-03-29T14:02:37.840160634Z level=info msg="App mode production" 
                  logger=sqlstore t=2023-03-29T14:02:37.840252308Z level=info msg="Connecting to DB" dbtype=sqlite3
                  logger=migrator t=2023-03-29T14:02:40.15632504Z level=info msg="Starting DB migrations" 
                  logger=migrator t=2023-03-29T14:02:40.334212586Z level=info msg="Executing migration" id="clear migration entry \"remove unified alerting data\"" 
                  logger=migrator t=2023-03-29T14:02:40.965791527Z level=info msg="Executing migration" id="Add column org_id to builtin_role table" 
                  logger=migrator t=2023-03-29T14:02:40.966004875Z level=error msg="Executing migration failed" id="Add column org_id to builtin_role table" error="duplicate column name: org_id" 
                  logger=migrator t=2023-03-29T14:02:40.966178592Z level=error msg="Exec failed" error="duplicate column name: org_id" sql="alter table `builtin_role` ADD COLUMN `org_id` INTEGER NOT NULL DEFAULT 0 " 
                  Error: ✗ failed to initialize runner: migration failed (id = Add column org_id to builtin_role table): duplicate column name: org_id

I did some googling, there's a kind of similar issue here:

https://community.grafana.com/t/error-on-starting-grafana-with-an-empty-database/8614

I'm wondering if we need to do something to initialize the database first.

Actions #39

Updated by Peter Amstutz about 1 year ago

huh, the MONITORING_* variables were missing from my local.params, that might have had something to do with it.

Actions #40

Updated by Peter Amstutz about 1 year ago

I have it working. I still need to tear down the cluster and install from scratch to confirm that it works without any intervention.

However, the "keepstore bandwidth" and "keepstore bytes by type" graphs are still showing "no data" even after the cluster has been up for at least 12 hours.

Actions #41

Updated by Peter Amstutz about 1 year ago

Figured it out. I did a new install from scratch and it looks like the monitoring is now 100% working out of the box. Very exciting.

Actions #42

Updated by Peter Amstutz about 1 year ago

16379-installer-prometheus-grafana @ 41935b6bc7e8bec90e284082c169479e8f02e4cd

Actions #43

Updated by Peter Amstutz about 1 year ago

  • Related to Feature #20285: System status panel that embeds grafana added
Actions #44

Updated by Peter Amstutz about 1 year ago

So, I think this is working pretty well, the only thing that is mildly annoying is that you have to go hunting for the dashboards instead of them being available on the front page by default. I've added documentation explaining how to find them, but having the installer perform some API calls to "star" them automatically would be even better.

My other thought is it would be great to have a link in the UI, I added #20285

Actions #45

Updated by Lucas Di Pentima about 1 year ago

I have a couple of comments on the new updates:

  • Would you consider adding **/terraform.tfstate* to .gitignore? I personally don't feel comfortable committing terraform state to a repository that will get distributed to many hosts (shell nodes included).
  • The terraform-destroy option will fail if the cluster has any data on keep. To avoid confusing error messages, we can either show warning message, or maybe making terraform skip the S3 bucket, by removing that resource from the state before running destroy.
Actions #46

Updated by Peter Amstutz about 1 year ago

Lucas Di Pentima wrote in #note-45:

I have a couple of comments on the new updates:

  • Would you consider adding **/terraform.tfstate* to .gitignore? I personally don't feel comfortable committing terraform state to a repository that will get distributed to many hosts (shell nodes included).

So, I wanted to make sure tfstate gets added to git because that's exactly the sort of thing you want kept in a live configuration management repo.

Assuming the cloud credentials are not saved (it doesn't look like they are) I don't think there is anything in .tfstate that is any more sensitive than all the other secrets that get distributed (the only other credentials I could find are ones specifically used by Let's Encrypt).

Is the idea simply to avoid having too much information available in case someone cracks the admin account? All the Arvados credentials are already kept in the config file.

One thought I had is to checkout to /tmp and rm -rf the staging repo and checkout after each run so it doesn't hang around.

Longer term, these things could be kept in a secrets management system, but I think we would have to add some features to the config system to retrieve secrets -- but if the node is automatically authorized to fetch secrets it's not clear how different that is from having them on disk?

  • The terraform-destroy option will fail if the cluster has any data on keep. To avoid confusing error messages, we can either show warning message, or maybe making terraform skip the S3 bucket, by removing that resource from the state before running destroy.

I think you do get back an error that mentions that you can't delete buckets unless they are empty. I added terraform-destroy because it was useful for testing. I deleted the bucket contents manually from the AWS console. That wouldn't work if there were more than a 1000 blocks, though. I don't know if there is an API (or AWS CLI command) to mass delete S3 objects or if you have to go through them one by one.

Actions #47

Updated by Lucas Di Pentima about 1 year ago

Peter Amstutz wrote in #note-46:

So, I wanted to make sure tfstate gets added to git because that's exactly the sort of thing you want kept in a live configuration management repo.

Yes, but what we have right now is a mix of things. I didn't want to make an ugly situation a bit worse, but if we'll fix this properly in the future, I'm OK with adding more secret data to what we already distribute. In an ideal situation, we would not use "masterless salt". We would just use salt with a central server that distributes just the data every node needs instead of everything.

I think you do get back an error that mentions that you can't delete buckets unless they are empty. I added terraform-destroy because it was useful for testing. I deleted the bucket contents manually from the AWS console. That wouldn't work if there were more than a 1000 blocks, though. I don't know if there is an API (or AWS CLI command) to mass delete S3 objects or if you have to go through them one by one.

Ok then, I'll merge the branch as is.

Actions #48

Updated by Lucas Di Pentima about 1 year ago

  • Status changed from In Progress to Resolved
Actions

Also available in: Atom PDF