Production PostgreSQL deployments fail silently — a replication lag spike, a lock pile-up, or runaway table bloat can degrade your application long before users start filing tickets. Prometheus and postgres_exporter give you the instrumentation to catch these problems early, with time-series metrics that integrate naturally into the Grafana dashboards and alerting pipelines your team already uses. This guide walks through the full stack: installing and securing the exporter, writing alerting rules that matter, and building Grafana panels that surface the right signals at a glance. Whether you are running a single-node dev database or a multi-replica production cluster, the setup described here scales with you.
- postgres_exporter translates PostgreSQL internal statistics into Prometheus metrics on port 9187 — no plugins or extensions required.
- Always use the
pg_monitorrole for the monitoring user; superuser is never necessary and is a security liability. - Enable
PG_EXPORTER_AUTO_DISCOVER_DATABASES=trueto automatically pick up new databases without restarting the exporter. - The five most actionable alerts are: instance down, replication lag, connection saturation, deadlocks, and table bloat.
- Custom
queries.yamlfiles extend the exporter beyond built-in metrics — use them for bloat ratios, index usage, and vacuum health.
What is postgres_exporter?
postgres_exporter is an open-source Prometheus exporter maintained by the Prometheus Community. It connects to one or more PostgreSQL instances, queries the system catalog and statistics views (pg_stat_activity, pg_stat_bgwriter, pg_replication_slots, and dozens more), and exposes the results as Prometheus-compatible metrics at http://<host>:9187/metrics.
Unlike application-level monitoring, postgres_exporter captures database-engine internals: buffer cache hit rates, checkpoint pressure, vacuum progress, lock contention, and replication lag. These are the signals that tell you whether PostgreSQL itself is healthy, independent of whether your application queries are slow or fast. The exporter is stateless — it performs no persistent writes and holds no local state — making it safe to restart, redeploy, or run in a container alongside the database.
The project ships with a large set of built-in metric queries that cover the most common monitoring needs out of the box. For anything more specialized — table bloat ratios, custom index usage percentages, vacuum health per schema — you can supply a queries.yaml file with arbitrary SQL that the exporter executes and exposes alongside the built-ins.
How the PostgreSQL Prometheus Stack Works
postgres_exporter Architecture
The monitoring stack has three components: postgres_exporter (collector), Prometheus (storage and alerting), and Grafana (visualization). postgres_exporter runs as a long-lived process on each database host. Prometheus scrapes the exporter's HTTP endpoint every 15 seconds (configurable), stores the resulting time-series data in its local TSDB, and evaluates alerting rules on each scrape cycle. Grafana queries Prometheus via PromQL to render dashboards.
postgres_exporter connects to PostgreSQL using a standard libpq connection string stored in the DATA_SOURCE_NAME environment variable. The connection is kept open and reused across scrapes, minimizing connection overhead. When PG_EXPORTER_AUTO_DISCOVER_DATABASES=true is set, the exporter queries pg_database on each scrape cycle and dynamically connects to any new databases that appear, without requiring a restart.
Never connect postgres_exporter with a superuser account in production — use the pg_monitor role which grants read-only access to all monitoring views without superuser privileges. A compromised exporter credential should have zero ability to modify data or schema.
Key Metrics Exposed
postgres_exporter surfaces metrics from all major PostgreSQL statistics subsystems. The list below covers the most operationally significant ones:
- pg_up — binary exporter connectivity check; 0 means the exporter cannot reach PostgreSQL.
- pg_stat_database_tup_fetched — cumulative rows fetched, useful as a query throughput proxy.
- pg_replication_lag — seconds of replication delay on each standby; critical for HA setups.
- pg_locks_count — number of active locks by mode; spikes indicate contention.
- pg_stat_bgwriter_buffers_checkpoint — buffers written during checkpoints; sustained high values mean checkpoint pressure.
- pg_database_size_bytes — per-database size in bytes; useful for capacity trending.
- pg_stat_activity_count — active connections grouped by state; compare against
max_connections. - pg_stat_database_blks_hit / blks_read — cache hit ratio components; target >99% hit rate.
- pg_stat_database_deadlocks — deadlock counter; any nonzero rate warrants investigation.
- pg_stat_user_tables_n_dead_tup — dead tuples per table; high values indicate vacuum is falling behind.
Installing and Configuring postgres_exporter
Step 1: Download and Install the Binary
postgres_exporter ships as a single statically linked binary with no runtime dependencies. Version 0.15.0 and later fully support PostgreSQL 17.
# Download latest release (v0.15+ supports PG17)
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-amd64.tar.gz
tar xvf postgres_exporter-0.15.0.linux-amd64.tar.gz
sudo mv postgres_exporter /usr/local/bin/Step 2: Create a Dedicated Monitoring User
The monitoring user needs access to PostgreSQL statistics views but should never have write access to any user data. The built-in pg_monitor role grants exactly the right permissions: read access to all pg_stat_* views, pg_settings, replication slots, and WAL sender state.
# Create dedicated monitoring user
psql -U postgres -c "CREATE USER monitoring WITH PASSWORD 'mon_password' CONNECTION LIMIT 3;"
psql -U postgres -c "GRANT pg_monitor TO monitoring;"The CONNECTION LIMIT 3 clause is a safeguard: even if the exporter is misconfigured to open multiple connections, it cannot exhaust your connection pool.
Step 3: Configure the Environment File
Store the connection string and exporter flags in /etc/default/postgres_exporter rather than baking them into the systemd unit. This makes credential rotation a one-line edit followed by a service restart.
# /etc/default/postgres_exporter
DATA_SOURCE_NAME="postgresql://monitoring:mon_password@localhost:5432/postgres?sslmode=disable"
PG_EXPORTER_AUTO_DISCOVER_DATABASES=true
PG_EXPORTER_DISABLE_SETTINGS_METRICS=falseUse PG_EXPORTER_AUTO_DISCOVER_DATABASES=true to automatically scrape metrics from all databases without restarting the exporter when new databases are created. This is especially useful in multi-tenant deployments where new customer databases are provisioned regularly.
Step 4: Create the systemd Service
Run the exporter as a dedicated low-privilege user. Create the system account first, then install the service unit:
sudo useradd --system --no-create-home --shell /bin/false postgres_exporter# /etc/systemd/system/postgres_exporter.service
[Unit]
Description=PostgreSQL Prometheus Exporter
After=network.target
[Service]
EnvironmentFile=/etc/default/postgres_exporter
ExecStart=/usr/local/bin/postgres_exporter
Restart=always
User=postgres_exporter
[Install]
WantedBy=multi-user.targetsudo systemctl daemon-reload
sudo systemctl enable --now postgres_exporter
# Verify metrics are being served
curl -s http://localhost:9187/metrics | grep pg_upStep 5: Configure Prometheus to Scrape the Exporter
Add a scrape job to your prometheus.yml. Label each target with env and cluster so your dashboards and alerts can filter by environment and replica set.
# prometheus.yml
scrape_configs:
- job_name: 'postgresql'
static_configs:
- targets: ['localhost:9187']
labels:
env: 'production'
cluster: 'pg-primary'
scrape_interval: 15s
scrape_timeout: 10sStep 6: Extend with Custom Queries
Create a queries.yaml file and pass it to the exporter via --extend.query-path=/etc/postgres_exporter/queries.yaml. The following query exposes per-table bloat ratios as a gauge metric, enabling Prometheus to alert when dead tuple accumulation exceeds safe thresholds:
pg_table_bloat:
query: |
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
n_dead_tup, n_live_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
WHERE n_live_tup > 1000
ORDER BY n_dead_tup DESC LIMIT 20;
metrics:
- schemaname:
usage: "LABEL"
- tablename:
usage: "LABEL"
- dead_pct:
usage: "GAUGE"
description: "Percentage of dead tuples"Building PostgreSQL Grafana Dashboards
Connection Saturation Gauge
The single most important operational panel is connection saturation: how close is the current active connection count to max_connections? Use a gauge panel with this PromQL expression:
sum(pg_stat_activity_count{state="active"}) by (instance)
/
pg_settings_max_connectionsSet thresholds at 70% (yellow) and 85% (red). When this gauge is red, new connections are being queued or refused, and application errors will follow immediately.
Replication Lag Time Series
For any high-availability cluster, plot pg_replication_lag as a time series with one line per replica. Add a horizontal threshold line at 300 seconds — your alerting rule boundary. Replication lag spikes that resolve quickly are usually network blips; sustained lag means the replica is genuinely falling behind and your failover RTO is degrading.
Cache Hit Ratio Panel
PostgreSQL's shared buffer cache hit ratio is the most sensitive indicator of memory pressure. Calculate it with:
pg_stat_database_blks_hit
/
(pg_stat_database_blks_hit + pg_stat_database_blks_read)A healthy OLTP database should sustain above 99% hit rate. If this drops below 95%, the working set is larger than shared_buffers and you are paying disk I/O for every cache miss. Either increase shared_buffers, add RAM, or investigate whether a query is performing sequential scans on large tables.
Transactions Per Second
Use the rate of pg_stat_database_xact_commit plus pg_stat_database_xact_rollback over a 1-minute window to show TPS. This is your application load proxy — overlay it with latency and error rate from your APM to correlate database load with application degradation.
Top Tables by Dead Tuple Ratio
A table panel driven by the custom pg_table_bloat query shows the top 20 tables ranked by dead tuple percentage. Tables above 20% bloat are candidates for manual VACUUM ANALYZE or autovacuum configuration tuning. Sort by dead_pct descending so the worst offenders are always visible at the top.
Lock Wait Events Heatmap
Aggregate pg_locks_count by mode and render as a heatmap or stacked time series. Sudden spikes in ExclusiveLock or AccessExclusiveLock modes during off-peak hours often indicate DDL migrations or bulk writes that are blocking application reads. A heatmap makes these patterns visible as distinct color bands rather than obscuring them in aggregate counts.
Alerting Rules for Production
The following Prometheus alerting rules cover the five failure modes most likely to cause a production incident. Save them to a file (e.g., /etc/prometheus/rules/postgresql.yml) and reference it from prometheus.yml under rule_files.
groups:
- name: postgresql
rules:
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down on {{ $labels.instance }}"
- alert: PostgreSQLReplicationLag
expr: pg_replication_lag > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Replication lag {{ $value }}s on {{ $labels.instance }}"
- alert: PostgreSQLTooManyConnections
expr: sum(pg_stat_activity_count) by (instance) > (pg_settings_max_connections * 0.85)
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $value }} connections on {{ $labels.instance }} (>85% of max)"
- alert: PostgreSQLDeadlocks
expr: rate(pg_stat_database_deadlocks[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Deadlocks detected on {{ $labels.datname }}"
- alert: PostgreSQLBloat
expr: pg_stat_user_tables_n_dead_tup / (pg_stat_user_tables_n_live_tup + 1) > 0.2
for: 10m
labels:
severity: warning
annotations:
summary: "Table bloat >20% on {{ $labels.relname }}"The for duration on each rule is deliberate. PostgreSQLDown fires after 1 minute — fast enough to page on-call before users notice. Replication lag has a 5-minute window to filter out transient network hiccups. Bloat uses 10 minutes because dead tuples accumulate gradually and a momentary spike above 20% is not actionable. Always tune these thresholds against your own baseline before enabling PagerDuty routing.
Do not enable all five alerts on day one without establishing baselines. Start with PostgreSQLDown and PostgreSQLTooManyConnections only. Add replication lag alerts when you have a replica. Add bloat and deadlock alerts after you have two weeks of historical data to calibrate thresholds against normal variance in your workload.
Key Takeaways
- postgres_exporter is the standard, production-proven bridge between PostgreSQL's statistics system and the Prometheus/Grafana monitoring stack — install it once and it covers all PostgreSQL versions from 9.4 to 17.
- Use the
pg_monitorrole, not superuser, for all monitoring connections; enforce this withCONNECTION LIMIT 3as an additional safeguard. - The five metrics that matter most in production are: exporter connectivity (
pg_up), connection saturation, replication lag, deadlock rate, and table bloat ratio. - Extend the exporter with
queries.yamlto surface PostgreSQL internals that built-in metrics do not cover — vacuum progress, index bloat, and long-running query counts are the most common additions. - Build Grafana dashboards around actionable signals, not raw counters. Cache hit ratio, TPS, and connection saturation answer "is the database healthy right now?" — which is what your on-call engineer needs at 3am.
- Stage your alerting rollout: connectivity and connection saturation first, replication and bloat alerts after you have baseline data.
Working with JusDB on PostgreSQL Monitoring
Setting up postgres_exporter and writing Prometheus rules is straightforward when you have a single database on a known infrastructure. In production, the monitoring stack has to account for replica topology changes, cross-cluster label consistency, alert routing to the right teams, and dashboard governance across dozens of services — none of which come for free with a default installation.
At JusDB, PostgreSQL monitoring is a core part of every database engagement. We configure postgres_exporter deployments with environment-specific label taxonomies so your Grafana dashboards and PagerDuty routing trees stay coherent as your cluster grows. We write custom queries.yaml extensions tuned to your schema — tracking vacuum lag on your highest-write tables, index bloat on your largest indexes, and connection pool saturation per application service. We also integrate replication health alerts into your existing incident management workflow, whether that is PagerDuty, OpsGenie, or Slack.
If you are running a PostgreSQL cluster today without structured metrics and alerting, every undetected replication lag event or connection storm is a production incident waiting to happen. Learn more about our PostgreSQL services or get in touch to discuss your monitoring setup.