Database SRE

Database Monitoring with Prometheus, Grafana, and PMM

Monitor MySQL and PostgreSQL with Prometheus, Grafana, and PMM. Covers mysqld_exporter, postgres_exporter, alert rules, and production dashboard configuration.

JusDB Team
October 26, 2022
8 min read
147 views

It was 2:17 AM when the on-call engineer's phone lit up — not from a monitoring alert, but from a frantic Slack message from a customer reporting the app was down. The culprit turned out to be an OOM kill on the primary MySQL node, triggered by a runaway query that had been holding locks and inflating buffer pool usage for over an hour. The team spent 40 minutes digging through raw MySQL slow query logs, piecing together what happened after the fact. A $15/month Grafana Cloud setup with a proper exporter and three alert rules would have caught the buffer pool saturation 20 minutes before the crash — enough time to kill the query and avoid the outage entirely.

TL;DR
  • Use mysqld_exporter (port 9104) and postgres_exporter (port 9187) to expose database metrics to Prometheus.
  • The five alerts every database must have: replication lag, connection saturation, disk space, slow query spike, and dead tuple bloat.
  • Percona Monitoring and Management (PMM) adds Query Analytics and automatic dashboards on top of raw Prometheus — ideal for teams that want a managed experience.
  • Set Prometheus scrape_interval to 15s for operational metrics and 60s for slower-changing metrics like replication slot lag.
  • Route critical alerts (replication lag, connection exhaustion) to PagerDuty and warnings to Slack via Alertmanager.
  • Use Grafana variables like $instance to build multi-host dashboards that scale across your fleet without duplicating panels.

The Monitoring Stack: Prometheus + Exporters + Grafana

The Prometheus ecosystem is the de facto standard for database observability in modern infrastructure. The architecture is deliberately simple: an exporter process runs alongside your database, translates internal database statistics into Prometheus-format metrics, and exposes them on an HTTP endpoint. Prometheus scrapes those endpoints on a defined interval and stores the time-series data. Grafana queries Prometheus to render dashboards. Alertmanager receives firing alerts from Prometheus and routes them to the right channel.

The data flow looks like this:

Database monitoring pipeline — Prometheus + Grafana + Alertmanager MySQL or PostgreSQL is scraped by a sidecar exporter; Prometheus pulls metrics every 15 seconds; the same metric store feeds Grafana dashboards and Alertmanager, which routes pages to PagerDuty or Slack. MySQL / PostgreSQL source of metrics Exporter (sidecar) mysqld_exporter · postgres_exporter exposes /metrics on :9104 or :9187 scrape every 15s Prometheus TSDB · pull-based 15-day default retention Grafana dashboards · ad-hoc query SLO panels · drill-down Alertmanager routing · grouping · silencing PagerDuty · Slack · Email Same metric stream feeds both dashboards (Grafana) and on-call alerting (Alertmanager) — no duplicate scrapes.
Pull-based monitoring pipeline. Prometheus is the metric backbone; Grafana visualizes, Alertmanager pages.

This architecture is stateless, horizontally scalable, and works equally well for a single database server or a fleet of 200 nodes. The key decision is which exporters to use and what metrics to prioritize — more data is not always better if you have no alert rules to act on it.

Tip: Start with the dashboards from Grafana's community dashboard library. Dashboard IDs 7362 (MySQL Overview) and 9628 (PostgreSQL Database) are well-maintained starting points. Customize from there rather than building from scratch.

MySQL Monitoring with mysqld_exporter

Install and Configure mysqld_exporter

The official mysqld_exporter from Prometheus connects to MySQL using a dedicated monitoring user. Create the user first, then point the exporter at it via the DATA_SOURCE_NAME environment variable.

sql
-- Create a dedicated monitoring user in MySQL
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'strongpassword' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;

Install and run the exporter as a systemd service:

yaml
# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=Prometheus MySQL Exporter
After=network.target

[Service]
User=prometheus
Environment="DATA_SOURCE_NAME=exporter:strongpassword@(localhost:3306)/"
ExecStart=/usr/local/bin/mysqld_exporter \
  --collect.info_schema.innodb_metrics \
  --collect.info_schema.processlist \
  --collect.slave_status \
  --collect.perf_schema.eventswaits
Restart=always

[Install]
WantedBy=multi-user.target

The exporter listens on port 9104 by default. Verify it is working with curl http://localhost:9104/metrics | grep mysql_up — you should see mysql_up 1.

Key MySQL Metrics to Alert On

Not every metric MySQL exposes is worth alerting on. Focus on the signals that indicate actual user impact or imminent failure:

  • mysql_global_status_threads_running — threads actively executing queries. A spike here means queries are piling up. Alert when it exceeds 20-30 for more than 2 minutes.
  • mysql_global_status_innodb_buffer_pool_reads — physical disk reads for InnoDB pages. A rising rate means your working set has outgrown the buffer pool. This is the canary for OOM risk.
  • mysql_slave_status_seconds_behind_master — replication lag in seconds. Alert at 30 seconds for warning, 120 seconds for critical.
  • mysql_global_status_max_used_connections / mysql_global_variables_max_connections — connection pool saturation ratio. Alert when this exceeds 80%.

Prometheus Scrape Config for MySQL

yaml
# prometheus.yml — scrape config for MySQL instances
scrape_configs:
  - job_name: 'mysql'
    scrape_interval: 15s
    scrape_timeout: 10s
    static_configs:
      - targets:
          - 'db-primary:9104'
          - 'db-replica-01:9104'
          - 'db-replica-02:9104'
        labels:
          environment: 'production'
          service: 'mysql'

Critical Grafana Panels to Build

The most actionable MySQL dashboard panels are: Threads Running (graph with a threshold line), Buffer Pool Hit Rate (calculated as 1 - (innodb_buffer_pool_reads / innodb_buffer_pool_read_requests) — alert below 99%), Slow Queries per Second, and Replication Lag. Use the $instance Grafana variable to make every panel filter by host so a single dashboard covers your entire MySQL fleet.

PostgreSQL Monitoring with postgres_exporter

Install and Configure postgres_exporter

The postgres_exporter from Prometheus Community connects to PostgreSQL using a standard connection string. The monitoring user needs specific privileges to read system catalog views:

sql
-- Create monitoring role in PostgreSQL
CREATE USER exporter WITH PASSWORD 'strongpassword';
GRANT pg_monitor TO exporter;
-- For PostgreSQL < 10, grant individually:
-- GRANT SELECT ON pg_stat_database TO exporter;
yaml
# /etc/systemd/system/postgres_exporter.service
[Unit]
Description=Prometheus PostgreSQL Exporter
After=network.target

[Service]
User=postgres
Environment="DATA_SOURCE_NAME=postgresql://exporter:strongpassword@localhost:5432/postgres?sslmode=disable"
ExecStart=/usr/local/bin/postgres_exporter \
  --extend.query-path=/etc/postgres_exporter/queries.yaml
Restart=always

[Install]
WantedBy=multi-user.target

The exporter listens on port 9187. Verify with curl http://localhost:9187/metrics | grep pg_up.

Key PostgreSQL Metrics

  • pg_stat_activity_count filtered by state="active" — active query count. Filtered by state="idle in transaction" it reveals connection leaks, which are a common source of deadlocks and lock waits.
  • pg_stat_bgwriter_buffers_clean — buffers written by the background writer. A rising count means checkpoints are being triggered too frequently; tune checkpoint_completion_target and max_wal_size.
  • pg_replication_slots_pg_wal_lsn_diff — WAL bytes retained for a replication slot. An unconsumed slot can fill up your disk. Alert at 10 GB retained.
  • pg_stat_user_tables_n_dead_tup — dead tuples per table. When n_dead_tup / (n_live_tup + n_dead_tup) > 0.1, autovacuum is falling behind and table bloat is accumulating.

Custom queries.yaml for Application-Specific Metrics

postgres_exporter supports a queries.yaml file for custom SQL metrics — use this to expose application-level data that the built-in metrics do not cover:

yaml
# /etc/postgres_exporter/queries.yaml
pg_table_bloat:
  query: |
    SELECT
      schemaname,
      tablename,
      n_live_tup,
      n_dead_tup,
      CASE WHEN (n_live_tup + n_dead_tup) > 0
           THEN round(n_dead_tup::numeric / (n_live_tup + n_dead_tup), 4)
           ELSE 0
      END AS dead_ratio
    FROM pg_stat_user_tables
    WHERE (n_live_tup + n_dead_tup) > 1000
    ORDER BY dead_ratio DESC
    LIMIT 20
  metrics:
    - schemaname:
        usage: LABEL
        description: Schema name
    - tablename:
        usage: LABEL
        description: Table name
    - n_live_tup:
        usage: GAUGE
        description: Live row count
    - n_dead_tup:
        usage: GAUGE
        description: Dead row count
    - dead_ratio:
        usage: GAUGE
        description: Ratio of dead to total tuples

Alerting Rules for PostgreSQL

yaml
# prometheus-alerts-postgresql.yml
groups:
  - name: postgresql
    rules:
      - alert: PostgreSQLDeadTupleBloat
        expr: |
          pg_table_bloat_dead_ratio{job="postgresql"} > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Table {{ $labels.tablename }} has > 10% dead tuples"
          description: "Dead tuple ratio is {{ $value | humanizePercentage }}. Consider running VACUUM ANALYZE."

      - alert: PostgreSQLReplicationSlotLag
        expr: |
          pg_replication_slots_pg_wal_lsn_diff{job="postgresql"} > 10737418240
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Replication slot {{ $labels.slot_name }} is retaining > 10 GB of WAL"

Percona Monitoring and Management (PMM)

What PMM Adds Over Raw Prometheus

If setting up and maintaining Prometheus exporters, Grafana dashboards, and Alertmanager configs feels like a significant operational burden, Percona Monitoring and Management (PMM) provides all of that pre-configured in a single Docker container. PMM ships with Query Analytics (QAN), which breaks down query performance by fingerprint — showing you which specific query pattern is responsible for the most execution time, wait events, or lock time. For teams without a dedicated DBA, QAN is often the fastest path to finding a slow query problem without writing custom Prometheus queries.

Important: PMM stores its time-series data in its own VictoriaMetrics instance, which is bundled inside the PMM Server container. If you are already running a Prometheus cluster, you can configure PMM to remote-write to it, but the default setup uses VictoriaMetrics internally. Plan your storage accordingly — PMM stores 30 days of metrics by default.

Installing PMM Server (Docker)

yaml
# Create a persistent volume for PMM data
docker volume create pmm-data

# Run PMM Server
docker run -d \
  -p 80:80 \
  -p 443:443 \
  --name pmm-server \
  --volume pmm-data:/srv \
  --restart always \
  percona/pmm-server:latest

Access the PMM UI at http://your-server-ip — default credentials are admin / admin. Change them immediately. PMM will prompt you to set up SSL; use the built-in Let's Encrypt integration if the server is publicly reachable, or provide your own certificates.

Adding MySQL and PostgreSQL Nodes

Install pmm-admin on each database host, then register it with the PMM Server:

# Register the host with PMM Server
pmm-admin config --server-url=https://admin:yourpassword@pmm-server-ip

# Add MySQL monitoring
pmm-admin add mysql \
  --username=pmm \
  --password=pass \
  --host=localhost \
  --port=3306 \
  --service-name=prod-mysql-primary \
  --query-source=perfschema

# Add PostgreSQL monitoring
pmm-admin add postgresql \
  --username=exporter \
  --password=strongpassword \
  --host=localhost \
  --port=5432 \
  --service-name=prod-postgresql-primary

After adding a node, Query Analytics data appears in the PMM UI within a few minutes. The slow query flamegraph view aggregates queries by fingerprint and shows execution time distribution — far faster to interpret than parsing raw slow query log files.

The 5 Alerts Every Database Must Have

Most teams have too many alerts, not too few. These five cover the failure modes that cause production outages or data loss:

1. Replication lag greater than 30 seconds. A replica that is more than 30 seconds behind is unreliable for read traffic and risky for failover. Alert at 30s (warning) and 120s (critical, page on-call).

2. Connection pool saturation above 80% of max_connections. At 80% of max_connections, new application threads will start seeing "too many connections" errors within minutes if the load does not subside. This needs a PagerDuty alert, not just a Slack notification.

3. Disk space below 20% free. Database servers fill up faster than expected — binary logs, WAL files, and temp files accumulate silently. Alert at 20% remaining; by the time you hit 5%, you may already be in a degraded state.

4. Slow query rate spike of more than 2x baseline. Use a Prometheus recording rule to calculate the 7-day median slow query rate per instance, then alert when the current rate exceeds double that baseline. This avoids false positives from natural traffic growth while catching real regressions.

5. Dead tuple bloat above 10% in PostgreSQL. Alert when any table has n_dead_tup / (n_live_tup + n_dead_tup) > 0.1. If autovacuum cannot keep up, the table will grow unbounded, queries will slow down, and eventually you will need an emergency VACUUM FULL with a table lock.

yaml
# alertmanager.yml — route critical alerts to PagerDuty, warnings to Slack
route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-warnings'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
        description: '{{ .CommonAnnotations.summary }}'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#db-alerts'
        text: '{{ .CommonAnnotations.description }}'

Grafana Dashboard Setup

A well-structured Grafana dashboard separates health status from performance deep-dives. The top row should contain stat panels showing current values for connection saturation, replication lag, and disk usage — traffic-light style with green/yellow/red thresholds. Engineers should be able to read this row from across the room and know if the database is healthy.

Use the $instance template variable to make your dashboards multi-host:

yaml
# Grafana dashboard variable definition (via API or dashboard JSON)
templating:
  list:
    - name: instance
      type: query
      datasource: Prometheus
      query: label_values(mysql_up, instance)
      refresh: 2  # Refresh on time range change
      multi: false
      includeAll: false

With $instance defined, every panel query references it as a label filter: mysql_global_status_threads_running{instance="$instance"}. You now have a single dashboard that covers your entire MySQL fleet via a dropdown selector rather than one dashboard per host.

Warning: Avoid setting Prometheus scrape_interval below 10 seconds for database exporters. mysqld_exporter and postgres_exporter queries hit live system tables — aggressive scraping at 5s intervals can itself add measurable query load to a busy database server. Use 15s for operational metrics and 60s for metrics like replication slot lag that change slowly.

For time range defaults, set the dashboard default to "Last 3 hours" rather than "Last 6 hours" — shorter time windows make it easier to spot sharp anomalies visually. Use annotations to mark deployments and schema changes so engineers can correlate database metric changes with application releases.

Key Takeaways
  • mysqld_exporter (port 9104) and postgres_exporter (port 9187) are the correct exporters for MySQL and PostgreSQL respectively. Both use DATA_SOURCE_NAME for credentials and require a dedicated monitoring user with minimal privileges.
  • The five mandatory alerts are: replication lag, connection saturation, disk space, slow query rate spike, and dead tuple bloat. Everything else is secondary until these are covered and tested.
  • Route critical alerts to PagerDuty and warnings to Slack in Alertmanager. Do not send everything to PagerDuty — alert fatigue defeats the purpose of monitoring.
  • PMM provides Query Analytics and pre-built dashboards as a managed layer on top of Prometheus. It is the fastest path to actionable query-level insights without building custom dashboards from scratch.
  • Use queries.yaml in postgres_exporter to expose custom application metrics — table bloat ratios, queue depths, or business-level counters — without needing a separate exporter.
  • The $instance Grafana variable is the single most important dashboard design decision — it is what makes one dashboard work across your entire fleet instead of requiring one dashboard per host.

Working with JusDB on Database Monitoring

JusDB deploys and configures production-grade database monitoring stacks — Prometheus exporters, Grafana dashboards, and Alertmanager routing — for MySQL and PostgreSQL deployments. We set alert thresholds based on your workload baseline, not generic defaults.

Explore JusDB MySQL Services →  |  Talk to a DBA

Related reading:

Share this article