Database SRE

PostgreSQL Monitoring with Prometheus and postgres_exporter: A Production Guide

Set up PostgreSQL monitoring with Prometheus and postgres_exporter. Includes install steps, critical alert rules, Grafana dashboard panels, and custom query metrics.

JusDB Team
March 5, 2026
10 min read
234 views

Production PostgreSQL deployments fail silently — a replication lag spike, a lock pile-up, or runaway table bloat can degrade your application long before users start filing tickets. Prometheus and postgres_exporter give you the instrumentation to catch these problems early, with time-series metrics that integrate naturally into the Grafana dashboards and alerting pipelines your team already uses. This guide walks through the full stack: installing and securing the exporter, writing alerting rules that matter, and building Grafana panels that surface the right signals at a glance. Whether you are running a single-node dev database or a multi-replica production cluster, the setup described here scales with you.

TL;DR
  • postgres_exporter translates PostgreSQL internal statistics into Prometheus metrics on port 9187 — no plugins or extensions required.
  • Always use the pg_monitor role for the monitoring user; superuser is never necessary and is a security liability.
  • Enable PG_EXPORTER_AUTO_DISCOVER_DATABASES=true to automatically pick up new databases without restarting the exporter.
  • The five most actionable alerts are: instance down, replication lag, connection saturation, deadlocks, and table bloat.
  • Custom queries.yaml files extend the exporter beyond built-in metrics — use them for bloat ratios, index usage, and vacuum health.

What is postgres_exporter?

postgres_exporter is an open-source Prometheus exporter maintained by the Prometheus Community. It connects to one or more PostgreSQL instances, queries the system catalog and statistics views (pg_stat_activity, pg_stat_bgwriter, pg_replication_slots, and dozens more), and exposes the results as Prometheus-compatible metrics at http://<host>:9187/metrics.

Unlike application-level monitoring, postgres_exporter captures database-engine internals: buffer cache hit rates, checkpoint pressure, vacuum progress, lock contention, and replication lag. These are the signals that tell you whether PostgreSQL itself is healthy, independent of whether your application queries are slow or fast. The exporter is stateless — it performs no persistent writes and holds no local state — making it safe to restart, redeploy, or run in a container alongside the database.

The project ships with a large set of built-in metric queries that cover the most common monitoring needs out of the box. For anything more specialized — table bloat ratios, custom index usage percentages, vacuum health per schema — you can supply a queries.yaml file with arbitrary SQL that the exporter executes and exposes alongside the built-ins.

How the PostgreSQL Prometheus Stack Works

postgres_exporter Architecture

The monitoring stack has three components: postgres_exporter (collector), Prometheus (storage and alerting), and Grafana (visualization). postgres_exporter runs as a long-lived process on each database host. Prometheus scrapes the exporter's HTTP endpoint every 15 seconds (configurable), stores the resulting time-series data in its local TSDB, and evaluates alerting rules on each scrape cycle. Grafana queries Prometheus via PromQL to render dashboards.

postgres_exporter connects to PostgreSQL using a standard libpq connection string stored in the DATA_SOURCE_NAME environment variable. The connection is kept open and reused across scrapes, minimizing connection overhead. When PG_EXPORTER_AUTO_DISCOVER_DATABASES=true is set, the exporter queries pg_database on each scrape cycle and dynamically connects to any new databases that appear, without requiring a restart.

Security Note

Never connect postgres_exporter with a superuser account in production — use the pg_monitor role which grants read-only access to all monitoring views without superuser privileges. A compromised exporter credential should have zero ability to modify data or schema.

Key Metrics Exposed

postgres_exporter surfaces metrics from all major PostgreSQL statistics subsystems. The list below covers the most operationally significant ones:

  • pg_up — binary exporter connectivity check; 0 means the exporter cannot reach PostgreSQL.
  • pg_stat_database_tup_fetched — cumulative rows fetched, useful as a query throughput proxy.
  • pg_replication_lag — seconds of replication delay on each standby; critical for HA setups.
  • pg_locks_count — number of active locks by mode; spikes indicate contention.
  • pg_stat_bgwriter_buffers_checkpoint — buffers written during checkpoints; sustained high values mean checkpoint pressure.
  • pg_database_size_bytes — per-database size in bytes; useful for capacity trending.
  • pg_stat_activity_count — active connections grouped by state; compare against max_connections.
  • pg_stat_database_blks_hit / blks_read — cache hit ratio components; target >99% hit rate.
  • pg_stat_database_deadlocks — deadlock counter; any nonzero rate warrants investigation.
  • pg_stat_user_tables_n_dead_tup — dead tuples per table; high values indicate vacuum is falling behind.

Installing and Configuring postgres_exporter

Step 1: Download and Install the Binary

postgres_exporter ships as a single statically linked binary with no runtime dependencies. Version 0.15.0 and later fully support PostgreSQL 17.

bash
# Download latest release (v0.15+ supports PG17)
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-amd64.tar.gz
tar xvf postgres_exporter-0.15.0.linux-amd64.tar.gz
sudo mv postgres_exporter /usr/local/bin/

Step 2: Create a Dedicated Monitoring User

The monitoring user needs access to PostgreSQL statistics views but should never have write access to any user data. The built-in pg_monitor role grants exactly the right permissions: read access to all pg_stat_* views, pg_settings, replication slots, and WAL sender state.

bash
# Create dedicated monitoring user
psql -U postgres -c "CREATE USER monitoring WITH PASSWORD 'mon_password' CONNECTION LIMIT 3;"
psql -U postgres -c "GRANT pg_monitor TO monitoring;"

The CONNECTION LIMIT 3 clause is a safeguard: even if the exporter is misconfigured to open multiple connections, it cannot exhaust your connection pool.

Step 3: Configure the Environment File

Store the connection string and exporter flags in /etc/default/postgres_exporter rather than baking them into the systemd unit. This makes credential rotation a one-line edit followed by a service restart.

bash
# /etc/default/postgres_exporter
DATA_SOURCE_NAME="postgresql://monitoring:mon_password@localhost:5432/postgres?sslmode=disable"
PG_EXPORTER_AUTO_DISCOVER_DATABASES=true
PG_EXPORTER_DISABLE_SETTINGS_METRICS=false
Tip

Use PG_EXPORTER_AUTO_DISCOVER_DATABASES=true to automatically scrape metrics from all databases without restarting the exporter when new databases are created. This is especially useful in multi-tenant deployments where new customer databases are provisioned regularly.

Step 4: Create the systemd Service

Run the exporter as a dedicated low-privilege user. Create the system account first, then install the service unit:

bash
sudo useradd --system --no-create-home --shell /bin/false postgres_exporter
ini
# /etc/systemd/system/postgres_exporter.service
[Unit]
Description=PostgreSQL Prometheus Exporter
After=network.target

[Service]
EnvironmentFile=/etc/default/postgres_exporter
ExecStart=/usr/local/bin/postgres_exporter
Restart=always
User=postgres_exporter

[Install]
WantedBy=multi-user.target
bash
sudo systemctl daemon-reload
sudo systemctl enable --now postgres_exporter
# Verify metrics are being served
curl -s http://localhost:9187/metrics | grep pg_up

Step 5: Configure Prometheus to Scrape the Exporter

Add a scrape job to your prometheus.yml. Label each target with env and cluster so your dashboards and alerts can filter by environment and replica set.

yaml
# prometheus.yml
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['localhost:9187']
        labels:
          env: 'production'
          cluster: 'pg-primary'
    scrape_interval: 15s
    scrape_timeout: 10s

Step 6: Extend with Custom Queries

Create a queries.yaml file and pass it to the exporter via --extend.query-path=/etc/postgres_exporter/queries.yaml. The following query exposes per-table bloat ratios as a gauge metric, enabling Prometheus to alert when dead tuple accumulation exceeds safe thresholds:

yaml
pg_table_bloat:
  query: |
    SELECT schemaname, tablename,
      pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
      n_dead_tup, n_live_tup,
      round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_pct
    FROM pg_stat_user_tables
    WHERE n_live_tup > 1000
    ORDER BY n_dead_tup DESC LIMIT 20;
  metrics:
    - schemaname:
        usage: "LABEL"
    - tablename:
        usage: "LABEL"
    - dead_pct:
        usage: "GAUGE"
        description: "Percentage of dead tuples"

Building PostgreSQL Grafana Dashboards

Connection Saturation Gauge

The single most important operational panel is connection saturation: how close is the current active connection count to max_connections? Use a gauge panel with this PromQL expression:

bash
sum(pg_stat_activity_count{state="active"}) by (instance)
  /
pg_settings_max_connections

Set thresholds at 70% (yellow) and 85% (red). When this gauge is red, new connections are being queued or refused, and application errors will follow immediately.

Replication Lag Time Series

For any high-availability cluster, plot pg_replication_lag as a time series with one line per replica. Add a horizontal threshold line at 300 seconds — your alerting rule boundary. Replication lag spikes that resolve quickly are usually network blips; sustained lag means the replica is genuinely falling behind and your failover RTO is degrading.

Cache Hit Ratio Panel

PostgreSQL's shared buffer cache hit ratio is the most sensitive indicator of memory pressure. Calculate it with:

bash
pg_stat_database_blks_hit
  /
(pg_stat_database_blks_hit + pg_stat_database_blks_read)

A healthy OLTP database should sustain above 99% hit rate. If this drops below 95%, the working set is larger than shared_buffers and you are paying disk I/O for every cache miss. Either increase shared_buffers, add RAM, or investigate whether a query is performing sequential scans on large tables.

Transactions Per Second

Use the rate of pg_stat_database_xact_commit plus pg_stat_database_xact_rollback over a 1-minute window to show TPS. This is your application load proxy — overlay it with latency and error rate from your APM to correlate database load with application degradation.

Top Tables by Dead Tuple Ratio

A table panel driven by the custom pg_table_bloat query shows the top 20 tables ranked by dead tuple percentage. Tables above 20% bloat are candidates for manual VACUUM ANALYZE or autovacuum configuration tuning. Sort by dead_pct descending so the worst offenders are always visible at the top.

Lock Wait Events Heatmap

Aggregate pg_locks_count by mode and render as a heatmap or stacked time series. Sudden spikes in ExclusiveLock or AccessExclusiveLock modes during off-peak hours often indicate DDL migrations or bulk writes that are blocking application reads. A heatmap makes these patterns visible as distinct color bands rather than obscuring them in aggregate counts.

Alerting Rules for Production

The following Prometheus alerting rules cover the five failure modes most likely to cause a production incident. Save them to a file (e.g., /etc/prometheus/rules/postgresql.yml) and reference it from prometheus.yml under rule_files.

yaml
groups:
  - name: postgresql
    rules:
      - alert: PostgreSQLDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down on {{ $labels.instance }}"

      - alert: PostgreSQLReplicationLag
        expr: pg_replication_lag > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag {{ $value }}s on {{ $labels.instance }}"

      - alert: PostgreSQLTooManyConnections
        expr: sum(pg_stat_activity_count) by (instance) > (pg_settings_max_connections * 0.85)
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} connections on {{ $labels.instance }} (>85% of max)"

      - alert: PostgreSQLDeadlocks
        expr: rate(pg_stat_database_deadlocks[5m]) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Deadlocks detected on {{ $labels.datname }}"

      - alert: PostgreSQLBloat
        expr: pg_stat_user_tables_n_dead_tup / (pg_stat_user_tables_n_live_tup + 1) > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Table bloat >20% on {{ $labels.relname }}"
Alert Tuning

The for duration on each rule is deliberate. PostgreSQLDown fires after 1 minute — fast enough to page on-call before users notice. Replication lag has a 5-minute window to filter out transient network hiccups. Bloat uses 10 minutes because dead tuples accumulate gradually and a momentary spike above 20% is not actionable. Always tune these thresholds against your own baseline before enabling PagerDuty routing.

Alert Fatigue Warning

Do not enable all five alerts on day one without establishing baselines. Start with PostgreSQLDown and PostgreSQLTooManyConnections only. Add replication lag alerts when you have a replica. Add bloat and deadlock alerts after you have two weeks of historical data to calibrate thresholds against normal variance in your workload.

Key Takeaways

Key Takeaways
  • postgres_exporter is the standard, production-proven bridge between PostgreSQL's statistics system and the Prometheus/Grafana monitoring stack — install it once and it covers all PostgreSQL versions from 9.4 to 17.
  • Use the pg_monitor role, not superuser, for all monitoring connections; enforce this with CONNECTION LIMIT 3 as an additional safeguard.
  • The five metrics that matter most in production are: exporter connectivity (pg_up), connection saturation, replication lag, deadlock rate, and table bloat ratio.
  • Extend the exporter with queries.yaml to surface PostgreSQL internals that built-in metrics do not cover — vacuum progress, index bloat, and long-running query counts are the most common additions.
  • Build Grafana dashboards around actionable signals, not raw counters. Cache hit ratio, TPS, and connection saturation answer "is the database healthy right now?" — which is what your on-call engineer needs at 3am.
  • Stage your alerting rollout: connectivity and connection saturation first, replication and bloat alerts after you have baseline data.

Working with JusDB on PostgreSQL Monitoring

Setting up postgres_exporter and writing Prometheus rules is straightforward when you have a single database on a known infrastructure. In production, the monitoring stack has to account for replica topology changes, cross-cluster label consistency, alert routing to the right teams, and dashboard governance across dozens of services — none of which come for free with a default installation.

At JusDB, PostgreSQL monitoring is a core part of every database engagement. We configure postgres_exporter deployments with environment-specific label taxonomies so your Grafana dashboards and PagerDuty routing trees stay coherent as your cluster grows. We write custom queries.yaml extensions tuned to your schema — tracking vacuum lag on your highest-write tables, index bloat on your largest indexes, and connection pool saturation per application service. We also integrate replication health alerts into your existing incident management workflow, whether that is PagerDuty, OpsGenie, or Slack.

If you are running a PostgreSQL cluster today without structured metrics and alerting, every undetected replication lag event or connection storm is a production incident waiting to happen. Learn more about our PostgreSQL services or get in touch to discuss your monitoring setup.

Share this article

JusDB Team

Official JusDB content team