Idea #16379
closedSaltStack install integrates with prometheus/grafana
Description
When using SaltStack for configuration management, admin can choose to use SaltStack formulas for prometheus and grafana. The Arvados formula integrates with them by providing prometheus config file and adding Arvados dashboard (#16213) to grafana.
Updated by Peter Amstutz over 4 years ago
- Related to Feature #16213: Default metrics dashboard for grafana added
Updated by Peter Amstutz over 4 years ago
- Related to Idea #16428: Metrics dashboard added
Updated by Lucas Di Pentima over 2 years ago
- Target version set to 2022-09-14 sprint
- Assigned To set to Lucas Di Pentima
Updated by Lucas Di Pentima over 2 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima over 2 years ago
- Status changed from In Progress to New
Updated by Peter Amstutz over 2 years ago
- Target version changed from 2022-09-14 sprint to 2022-09-28 sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-09-28 sprint to 2022-10-12 sprint
Updated by Lucas Di Pentima about 2 years ago
- Target version changed from 2022-10-12 sprint to 2022-10-26 sprint
Updated by Lucas Di Pentima about 2 years ago
- Target version changed from 2022-10-26 sprint to 2022-11-09 sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-11-09 sprint to 2022-11-23 sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-11-23 sprint to 2022-12-07 Sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-12-07 Sprint to 2022-12-21 Sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-12-21 Sprint to 2023-01-18 sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2023-01-18 sprint to 2022-12-21 Sprint
Updated by Peter Amstutz about 2 years ago
- Target version changed from 2022-12-21 Sprint to 2023-01-18 sprint
Updated by Peter Amstutz almost 2 years ago
- Target version changed from 2023-01-18 sprint to 2023-02-01 sprint
Updated by Peter Amstutz almost 2 years ago
- Release set to 59
- Target version deleted (
2023-02-01 sprint)
Updated by Peter Amstutz almost 2 years ago
- Target version set to To be scheduled
Updated by Peter Amstutz almost 2 years ago
- Target version changed from To be scheduled to 2023-02-01 sprint
Updated by Lucas Di Pentima almost 2 years ago
- Status changed from New to In Progress
Updated by Lucas Di Pentima almost 2 years ago
- Target version changed from 2023-02-01 sprint to 2023-02-15 sprint
Updated by Lucas Di Pentima almost 2 years ago
- Target version changed from 2023-02-15 sprint to 2023-03-01 sprint
Updated by Lucas Di Pentima almost 2 years ago
- Target version changed from 2023-03-01 sprint to Development 2023-03-15 sprint
Updated by Lucas Di Pentima almost 2 years ago
Updates at a8ceae6 - branch 16379-installer-prometheus
- Adds "prometheus" hostname to terraform.
- Configures
nginx
vhost for prometheus on the workbench node. - Installs prometheus using the corresponding salt formula.
- Installs postgresql exporter and mtail to monitor the database.
- Installs the node exporter to monitor every node's resources.
- Sets basic authentication for the prometheus UI website, configurable from
local.params
with sensible defaults.
The grafana branch in on the way.
Updated by Lucas Di Pentima almost 2 years ago
Updates at 4ea265c - branch 16379-installer-grafana
- Moves prometheus to its own hostname.
- Adds grafana, using the local prometheus as a data source. Uses the same credentials as prometheus for its admin user.
- Adds default dashboards (arvados, node exporter, pg exporter) assigned to prometheus data source.
Updated by Peter Amstutz almost 2 years ago
- Target version changed from Development 2023-03-15 sprint to Development 2023-03-29 Sprint
Updated by Lucas Di Pentima almost 2 years ago
Detected some issues when trying a cluster from scratch. I'm working on fixing them.
Updated by Lucas Di Pentima almost 2 years ago
Updates at 58006c8a7 - branch 16379-installer-grafana
- Fixes the prometheus htpassword file definition to depend on nginx
- Fixes keepweb lack of nginx service being applied.
- Sets grafana's admin password using the
grafana-cli
to make sure it's always synced.
Updated by Lucas Di Pentima over 1 year ago
Updates at 4e26d30a7 - branch 16379-installer-prometheus-grafana
- Rebased a copy of the
16379-installer-grafana
branch (which branched off16379-installer-prometheus
) to20270-salt-installer-less-instances
from #20270 so that it can be manually tested with less AWS resource requirements. - Updates prometheus config to not include
keep1
&keepproxy
nodes.
Updated by Lucas Di Pentima over 1 year ago
Updates at 247fd765b - branch 16379-installer-prometheus-grafana
- Fixes missing
monitoring
role assignment on theworkbench
node, that got lost in the rebase.
Updated by Peter Amstutz over 1 year ago
Something's not working with the grafana formula:
---------- ID: grafana-package-install-pkg-installed Function: pkg.installed Name: grafana Result: False Comment: Problem encountered installing package(s). Additional info follows: errors: - Running scope as unit: run-rc6b15a6a20094c4981c81382cac47d4d.scope E: Unable to locate package grafana Started: 21:48:48.780122 Duration: 1686.505 ms Changes: ---------- ID: grafana-service-running-service-running Function: service.running Name: grafana-server Result: False Comment: Recursive requisite found Changes:
admin@workbench:/$ ls /etc/apt/sources.list.d/ arvados.list phusionpassenger-official-bullseye.list salt.list
hmm.
Updated by Peter Amstutz over 1 year ago
- Target version changed from Development 2023-03-29 Sprint to Development 2023-04-12 sprint
Updated by Peter Amstutz over 1 year ago
Got past the install issue, had to add the repo to the grafana config.
Now it is failing in post install when it runs "grafana-cli admin reset-admin-password"
logger=settings t=2023-03-29T14:02:37.839614504Z level=info msg="Starting Grafana" version= commit= branch= compiled=1970-01-01T00:00:00Z logger=settings t=2023-03-29T14:02:37.839864043Z level=warn msg="\"sentry\" frontend logging provider is deprecated and will be removed in the next major version. Use \"grafana\" provider instead." logger=settings t=2023-03-29T14:02:37.839911935Z level=info msg="Config loaded from" file=/usr/share/grafana/conf/defaults.ini logger=settings t=2023-03-29T14:02:37.839928606Z level=info msg="Config loaded from" file=/etc/grafana/grafana.ini logger=settings t=2023-03-29T14:02:37.839951907Z level=info msg="Config overridden from command line" arg="default.paths.data=/var/lib/grafana" logger=settings t=2023-03-29T14:02:37.839971827Z level=info msg="Config overridden from command line" arg="default.paths.logs=/var/log/grafana" logger=settings t=2023-03-29T14:02:37.839985998Z level=info msg="Config overridden from command line" arg="default.paths.plugins=/var/lib/grafana/plugins" logger=settings t=2023-03-29T14:02:37.840018929Z level=info msg="Config overridden from command line" arg="default.paths.provisioning=/etc/grafana/provisioning" logger=settings t=2023-03-29T14:02:37.84003455Z level=info msg="Path Home" path=/usr/share/grafana logger=settings t=2023-03-29T14:02:37.840068241Z level=info msg="Path Data" path=/var/lib/grafana logger=settings t=2023-03-29T14:02:37.840083492Z level=info msg="Path Logs" path=/var/log/grafana logger=settings t=2023-03-29T14:02:37.840115943Z level=info msg="Path Plugins" path=/var/lib/grafana/plugins logger=settings t=2023-03-29T14:02:37.840130503Z level=info msg="Path Provisioning" path=/etc/grafana/provisioning logger=settings t=2023-03-29T14:02:37.840160634Z level=info msg="App mode production" logger=sqlstore t=2023-03-29T14:02:37.840252308Z level=info msg="Connecting to DB" dbtype=sqlite3 logger=migrator t=2023-03-29T14:02:40.15632504Z level=info msg="Starting DB migrations" logger=migrator t=2023-03-29T14:02:40.334212586Z level=info msg="Executing migration" id="clear migration entry \"remove unified alerting data\"" logger=migrator t=2023-03-29T14:02:40.965791527Z level=info msg="Executing migration" id="Add column org_id to builtin_role table" logger=migrator t=2023-03-29T14:02:40.966004875Z level=error msg="Executing migration failed" id="Add column org_id to builtin_role table" error="duplicate column name: org_id" logger=migrator t=2023-03-29T14:02:40.966178592Z level=error msg="Exec failed" error="duplicate column name: org_id" sql="alter table `builtin_role` ADD COLUMN `org_id` INTEGER NOT NULL DEFAULT 0 " Error: ✗ failed to initialize runner: migration failed (id = Add column org_id to builtin_role table): duplicate column name: org_id
I did some googling, there's a kind of similar issue here:
https://community.grafana.com/t/error-on-starting-grafana-with-an-empty-database/8614
I'm wondering if we need to do something to initialize the database first.
Updated by Peter Amstutz over 1 year ago
huh, the MONITORING_* variables were missing from my local.params, that might have had something to do with it.
Updated by Peter Amstutz over 1 year ago
I have it working. I still need to tear down the cluster and install from scratch to confirm that it works without any intervention.
However, the "keepstore bandwidth" and "keepstore bytes by type" graphs are still showing "no data" even after the cluster has been up for at least 12 hours.
Updated by Peter Amstutz over 1 year ago
Figured it out. I did a new install from scratch and it looks like the monitoring is now 100% working out of the box. Very exciting.
Updated by Peter Amstutz over 1 year ago
16379-installer-prometheus-grafana @ 41935b6bc7e8bec90e284082c169479e8f02e4cd
Updated by Peter Amstutz over 1 year ago
- Related to Feature #20285: System status panel that embeds grafana added
Updated by Peter Amstutz over 1 year ago
So, I think this is working pretty well, the only thing that is mildly annoying is that you have to go hunting for the dashboards instead of them being available on the front page by default. I've added documentation explaining how to find them, but having the installer perform some API calls to "star" them automatically would be even better.
My other thought is it would be great to have a link in the UI, I added #20285
Updated by Lucas Di Pentima over 1 year ago
I have a couple of comments on the new updates:
- Would you consider adding
**/terraform.tfstate*
to.gitignore
? I personally don't feel comfortable committing terraform state to a repository that will get distributed to many hosts (shell nodes included). - The
terraform-destroy
option will fail if the cluster has any data on keep. To avoid confusing error messages, we can either show warning message, or maybe making terraform skip the S3 bucket, by removing that resource from the state before runningdestroy
.
Updated by Peter Amstutz over 1 year ago
Lucas Di Pentima wrote in #note-45:
I have a couple of comments on the new updates:
- Would you consider adding
**/terraform.tfstate*
to.gitignore
? I personally don't feel comfortable committing terraform state to a repository that will get distributed to many hosts (shell nodes included).
So, I wanted to make sure tfstate gets added to git because that's exactly the sort of thing you want kept in a live configuration management repo.
Assuming the cloud credentials are not saved (it doesn't look like they are) I don't think there is anything in .tfstate that is any more sensitive than all the other secrets that get distributed (the only other credentials I could find are ones specifically used by Let's Encrypt).
Is the idea simply to avoid having too much information available in case someone cracks the admin account? All the Arvados credentials are already kept in the config file.
One thought I had is to checkout to /tmp and rm -rf the staging repo and checkout after each run so it doesn't hang around.
Longer term, these things could be kept in a secrets management system, but I think we would have to add some features to the config system to retrieve secrets -- but if the node is automatically authorized to fetch secrets it's not clear how different that is from having them on disk?
- The
terraform-destroy
option will fail if the cluster has any data on keep. To avoid confusing error messages, we can either show warning message, or maybe making terraform skip the S3 bucket, by removing that resource from the state before runningdestroy
.
I think you do get back an error that mentions that you can't delete buckets unless they are empty. I added terraform-destroy because it was useful for testing. I deleted the bucket contents manually from the AWS console. That wouldn't work if there were more than a 1000 blocks, though. I don't know if there is an API (or AWS CLI command) to mass delete S3 objects or if you have to go through them one by one.
Updated by Lucas Di Pentima over 1 year ago
Peter Amstutz wrote in #note-46:
So, I wanted to make sure tfstate gets added to git because that's exactly the sort of thing you want kept in a live configuration management repo.
Yes, but what we have right now is a mix of things. I didn't want to make an ugly situation a bit worse, but if we'll fix this properly in the future, I'm OK with adding more secret data to what we already distribute. In an ideal situation, we would not use "masterless salt". We would just use salt with a central server that distributes just the data every node needs instead of everything.
I think you do get back an error that mentions that you can't delete buckets unless they are empty. I added terraform-destroy because it was useful for testing. I deleted the bucket contents manually from the AWS console. That wouldn't work if there were more than a 1000 blocks, though. I don't know if there is an API (or AWS CLI command) to mass delete S3 objects or if you have to go through them one by one.
Ok then, I'll merge the branch as is.
Updated by Lucas Di Pentima over 1 year ago
- Status changed from In Progress to Resolved
Applied in changeset arvados|98969e546c909ac2ee4256934b5339080598d252.