It was 2:17 AM when the on-call engineer's phone lit up — not from a monitoring alert, but from a frantic Slack message from a customer reporting the app was down. The culprit turned out to be an OOM kill on the primary MySQL node, triggered by a runaway query that had been holding locks and inflating buffer pool usage for over an hour. The team spent 40 minutes digging through raw MySQL slow query logs, piecing together what happened after the fact. A $15/month Grafana Cloud setup with a proper exporter and three alert rules would have caught the buffer pool saturation 20 minutes before the crash — enough time to kill the query and avoid the outage entirely.
- Use mysqld_exporter (port 9104) and postgres_exporter (port 9187) to expose database metrics to Prometheus.
- The five alerts every database must have: replication lag, connection saturation, disk space, slow query spike, and dead tuple bloat.
- Percona Monitoring and Management (PMM) adds Query Analytics and automatic dashboards on top of raw Prometheus — ideal for teams that want a managed experience.
- Set Prometheus
scrape_intervalto 15s for operational metrics and 60s for slower-changing metrics like replication slot lag. - Route critical alerts (replication lag, connection exhaustion) to PagerDuty and warnings to Slack via Alertmanager.
- Use Grafana variables like
$instanceto build multi-host dashboards that scale across your fleet without duplicating panels.
The Monitoring Stack: Prometheus + Exporters + Grafana
The Prometheus ecosystem is the de facto standard for database observability in modern infrastructure. The architecture is deliberately simple: an exporter process runs alongside your database, translates internal database statistics into Prometheus-format metrics, and exposes them on an HTTP endpoint. Prometheus scrapes those endpoints on a defined interval and stores the time-series data. Grafana queries Prometheus to render dashboards. Alertmanager receives firing alerts from Prometheus and routes them to the right channel.
The data flow looks like this:
This architecture is stateless, horizontally scalable, and works equally well for a single database server or a fleet of 200 nodes. The key decision is which exporters to use and what metrics to prioritize — more data is not always better if you have no alert rules to act on it.
MySQL Monitoring with mysqld_exporter
Install and Configure mysqld_exporter
The official mysqld_exporter from Prometheus connects to MySQL using a dedicated monitoring user. Create the user first, then point the exporter at it via the DATA_SOURCE_NAME environment variable.
-- Create a dedicated monitoring user in MySQL
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'strongpassword' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;Install and run the exporter as a systemd service:
# /etc/systemd/system/mysqld_exporter.service
[Unit]
Description=Prometheus MySQL Exporter
After=network.target
[Service]
User=prometheus
Environment="DATA_SOURCE_NAME=exporter:strongpassword@(localhost:3306)/"
ExecStart=/usr/local/bin/mysqld_exporter \
--collect.info_schema.innodb_metrics \
--collect.info_schema.processlist \
--collect.slave_status \
--collect.perf_schema.eventswaits
Restart=always
[Install]
WantedBy=multi-user.targetThe exporter listens on port 9104 by default. Verify it is working with curl http://localhost:9104/metrics | grep mysql_up — you should see mysql_up 1.
Key MySQL Metrics to Alert On
Not every metric MySQL exposes is worth alerting on. Focus on the signals that indicate actual user impact or imminent failure:
mysql_global_status_threads_running— threads actively executing queries. A spike here means queries are piling up. Alert when it exceeds 20-30 for more than 2 minutes.mysql_global_status_innodb_buffer_pool_reads— physical disk reads for InnoDB pages. A rising rate means your working set has outgrown the buffer pool. This is the canary for OOM risk.mysql_slave_status_seconds_behind_master— replication lag in seconds. Alert at 30 seconds for warning, 120 seconds for critical.mysql_global_status_max_used_connections / mysql_global_variables_max_connections— connection pool saturation ratio. Alert when this exceeds 80%.
Prometheus Scrape Config for MySQL
# prometheus.yml — scrape config for MySQL instances
scrape_configs:
- job_name: 'mysql'
scrape_interval: 15s
scrape_timeout: 10s
static_configs:
- targets:
- 'db-primary:9104'
- 'db-replica-01:9104'
- 'db-replica-02:9104'
labels:
environment: 'production'
service: 'mysql'Critical Grafana Panels to Build
The most actionable MySQL dashboard panels are: Threads Running (graph with a threshold line), Buffer Pool Hit Rate (calculated as 1 - (innodb_buffer_pool_reads / innodb_buffer_pool_read_requests) — alert below 99%), Slow Queries per Second, and Replication Lag. Use the $instance Grafana variable to make every panel filter by host so a single dashboard covers your entire MySQL fleet.
PostgreSQL Monitoring with postgres_exporter
Install and Configure postgres_exporter
The postgres_exporter from Prometheus Community connects to PostgreSQL using a standard connection string. The monitoring user needs specific privileges to read system catalog views:
-- Create monitoring role in PostgreSQL
CREATE USER exporter WITH PASSWORD 'strongpassword';
GRANT pg_monitor TO exporter;
-- For PostgreSQL < 10, grant individually:
-- GRANT SELECT ON pg_stat_database TO exporter;# /etc/systemd/system/postgres_exporter.service
[Unit]
Description=Prometheus PostgreSQL Exporter
After=network.target
[Service]
User=postgres
Environment="DATA_SOURCE_NAME=postgresql://exporter:strongpassword@localhost:5432/postgres?sslmode=disable"
ExecStart=/usr/local/bin/postgres_exporter \
--extend.query-path=/etc/postgres_exporter/queries.yaml
Restart=always
[Install]
WantedBy=multi-user.targetThe exporter listens on port 9187. Verify with curl http://localhost:9187/metrics | grep pg_up.
Key PostgreSQL Metrics
pg_stat_activity_countfiltered bystate="active"— active query count. Filtered bystate="idle in transaction"it reveals connection leaks, which are a common source of deadlocks and lock waits.pg_stat_bgwriter_buffers_clean— buffers written by the background writer. A rising count means checkpoints are being triggered too frequently; tunecheckpoint_completion_targetandmax_wal_size.pg_replication_slots_pg_wal_lsn_diff— WAL bytes retained for a replication slot. An unconsumed slot can fill up your disk. Alert at 10 GB retained.pg_stat_user_tables_n_dead_tup— dead tuples per table. Whenn_dead_tup / (n_live_tup + n_dead_tup) > 0.1, autovacuum is falling behind and table bloat is accumulating.
Custom queries.yaml for Application-Specific Metrics
postgres_exporter supports a queries.yaml file for custom SQL metrics — use this to expose application-level data that the built-in metrics do not cover:
# /etc/postgres_exporter/queries.yaml
pg_table_bloat:
query: |
SELECT
schemaname,
tablename,
n_live_tup,
n_dead_tup,
CASE WHEN (n_live_tup + n_dead_tup) > 0
THEN round(n_dead_tup::numeric / (n_live_tup + n_dead_tup), 4)
ELSE 0
END AS dead_ratio
FROM pg_stat_user_tables
WHERE (n_live_tup + n_dead_tup) > 1000
ORDER BY dead_ratio DESC
LIMIT 20
metrics:
- schemaname:
usage: LABEL
description: Schema name
- tablename:
usage: LABEL
description: Table name
- n_live_tup:
usage: GAUGE
description: Live row count
- n_dead_tup:
usage: GAUGE
description: Dead row count
- dead_ratio:
usage: GAUGE
description: Ratio of dead to total tuplesAlerting Rules for PostgreSQL
# prometheus-alerts-postgresql.yml
groups:
- name: postgresql
rules:
- alert: PostgreSQLDeadTupleBloat
expr: |
pg_table_bloat_dead_ratio{job="postgresql"} > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Table {{ $labels.tablename }} has > 10% dead tuples"
description: "Dead tuple ratio is {{ $value | humanizePercentage }}. Consider running VACUUM ANALYZE."
- alert: PostgreSQLReplicationSlotLag
expr: |
pg_replication_slots_pg_wal_lsn_diff{job="postgresql"} > 10737418240
for: 5m
labels:
severity: critical
annotations:
summary: "Replication slot {{ $labels.slot_name }} is retaining > 10 GB of WAL"Percona Monitoring and Management (PMM)
What PMM Adds Over Raw Prometheus
If setting up and maintaining Prometheus exporters, Grafana dashboards, and Alertmanager configs feels like a significant operational burden, Percona Monitoring and Management (PMM) provides all of that pre-configured in a single Docker container. PMM ships with Query Analytics (QAN), which breaks down query performance by fingerprint — showing you which specific query pattern is responsible for the most execution time, wait events, or lock time. For teams without a dedicated DBA, QAN is often the fastest path to finding a slow query problem without writing custom Prometheus queries.
Installing PMM Server (Docker)
# Create a persistent volume for PMM data
docker volume create pmm-data
# Run PMM Server
docker run -d \
-p 80:80 \
-p 443:443 \
--name pmm-server \
--volume pmm-data:/srv \
--restart always \
percona/pmm-server:latestAccess the PMM UI at http://your-server-ip — default credentials are admin / admin. Change them immediately. PMM will prompt you to set up SSL; use the built-in Let's Encrypt integration if the server is publicly reachable, or provide your own certificates.
Adding MySQL and PostgreSQL Nodes
Install pmm-admin on each database host, then register it with the PMM Server:
# Register the host with PMM Server pmm-admin config --server-url=https://admin:yourpassword@pmm-server-ip # Add MySQL monitoring pmm-admin add mysql \ --username=pmm \ --password=pass \ --host=localhost \ --port=3306 \ --service-name=prod-mysql-primary \ --query-source=perfschema # Add PostgreSQL monitoring pmm-admin add postgresql \ --username=exporter \ --password=strongpassword \ --host=localhost \ --port=5432 \ --service-name=prod-postgresql-primary
After adding a node, Query Analytics data appears in the PMM UI within a few minutes. The slow query flamegraph view aggregates queries by fingerprint and shows execution time distribution — far faster to interpret than parsing raw slow query log files.
The 5 Alerts Every Database Must Have
Most teams have too many alerts, not too few. These five cover the failure modes that cause production outages or data loss:
1. Replication lag greater than 30 seconds. A replica that is more than 30 seconds behind is unreliable for read traffic and risky for failover. Alert at 30s (warning) and 120s (critical, page on-call).
2. Connection pool saturation above 80% of max_connections. At 80% of max_connections, new application threads will start seeing "too many connections" errors within minutes if the load does not subside. This needs a PagerDuty alert, not just a Slack notification.
3. Disk space below 20% free. Database servers fill up faster than expected — binary logs, WAL files, and temp files accumulate silently. Alert at 20% remaining; by the time you hit 5%, you may already be in a degraded state.
4. Slow query rate spike of more than 2x baseline. Use a Prometheus recording rule to calculate the 7-day median slow query rate per instance, then alert when the current rate exceeds double that baseline. This avoids false positives from natural traffic growth while catching real regressions.
5. Dead tuple bloat above 10% in PostgreSQL. Alert when any table has n_dead_tup / (n_live_tup + n_dead_tup) > 0.1. If autovacuum cannot keep up, the table will grow unbounded, queries will slow down, and eventually you will need an emergency VACUUM FULL with a table lock.
# alertmanager.yml — route critical alerts to PagerDuty, warnings to Slack
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-warnings'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_INTEGRATION_KEY'
description: '{{ .CommonAnnotations.summary }}'
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#db-alerts'
text: '{{ .CommonAnnotations.description }}'Grafana Dashboard Setup
A well-structured Grafana dashboard separates health status from performance deep-dives. The top row should contain stat panels showing current values for connection saturation, replication lag, and disk usage — traffic-light style with green/yellow/red thresholds. Engineers should be able to read this row from across the room and know if the database is healthy.
Use the $instance template variable to make your dashboards multi-host:
# Grafana dashboard variable definition (via API or dashboard JSON)
templating:
list:
- name: instance
type: query
datasource: Prometheus
query: label_values(mysql_up, instance)
refresh: 2 # Refresh on time range change
multi: false
includeAll: falseWith $instance defined, every panel query references it as a label filter: mysql_global_status_threads_running{instance="$instance"}. You now have a single dashboard that covers your entire MySQL fleet via a dropdown selector rather than one dashboard per host.
scrape_interval below 10 seconds for database exporters. mysqld_exporter and postgres_exporter queries hit live system tables — aggressive scraping at 5s intervals can itself add measurable query load to a busy database server. Use 15s for operational metrics and 60s for metrics like replication slot lag that change slowly.
For time range defaults, set the dashboard default to "Last 3 hours" rather than "Last 6 hours" — shorter time windows make it easier to spot sharp anomalies visually. Use annotations to mark deployments and schema changes so engineers can correlate database metric changes with application releases.
- mysqld_exporter (port 9104) and postgres_exporter (port 9187) are the correct exporters for MySQL and PostgreSQL respectively. Both use
DATA_SOURCE_NAMEfor credentials and require a dedicated monitoring user with minimal privileges. - The five mandatory alerts are: replication lag, connection saturation, disk space, slow query rate spike, and dead tuple bloat. Everything else is secondary until these are covered and tested.
- Route critical alerts to PagerDuty and warnings to Slack in Alertmanager. Do not send everything to PagerDuty — alert fatigue defeats the purpose of monitoring.
- PMM provides Query Analytics and pre-built dashboards as a managed layer on top of Prometheus. It is the fastest path to actionable query-level insights without building custom dashboards from scratch.
- Use
queries.yamlin postgres_exporter to expose custom application metrics — table bloat ratios, queue depths, or business-level counters — without needing a separate exporter. - The
$instanceGrafana variable is the single most important dashboard design decision — it is what makes one dashboard work across your entire fleet instead of requiring one dashboard per host.
Working with JusDB on Database Monitoring
JusDB deploys and configures production-grade database monitoring stacks — Prometheus exporters, Grafana dashboards, and Alertmanager routing — for MySQL and PostgreSQL deployments. We set alert thresholds based on your workload baseline, not generic defaults.
Explore JusDB MySQL Services → | Talk to a DBA
Related reading: