A standby server that is 30 seconds behind your primary is a ticking clock in any high-availability setup. When a failover is forced, that lag translates directly into data loss or a lengthy recovery window — neither of which is acceptable in production. PostgreSQL exposes a rich set of system views and functions that let you measure replication lag precisely, audit slot health, and confirm that your standbys are genuinely ready to take over. This guide walks through every layer of that monitoring stack, from raw LSN arithmetic to Patroni health checks, so you can catch problems before they become incidents.
- Use
pg_stat_replicationandpg_wal_lsn_diff()on the primary to measure per-standby byte lag across all four LSN milestones. - Query
pg_replication_slotsto detect inactive slots and calculate retained WAL before disk fills up. - Check
pg_stat_wal_receiveron each standby to confirm the receiver process is alive and connected. - Alert on replay lag > 60 s or retained WAL > 5 GB; page on slot bloat and inactive slots immediately.
- Validate
synchronous_commitsettings and Patroni health endpoints as part of every failover-readiness runbook.
Replication Overview: How PostgreSQL Streaming Replication Works
PostgreSQL streaming replication works by continuously shipping Write-Ahead Log (WAL) records from a primary to one or more standbys. The primary writes every change to its WAL first, then the WAL sender process streams those records to each standby's WAL receiver process over a persistent TCP connection. On the standby, a separate startup process replays the WAL into the data files, keeping the standby's data directory in sync.
Four LSN (Log Sequence Number) milestones describe where each standby is in that pipeline:
- sent_lsn — the last WAL position the primary has sent over the network to the standby.
- write_lsn — the last position the standby has written to its local WAL files (received but not necessarily flushed).
- flush_lsn — the last position the standby has flushed to durable storage (safe against a standby crash).
- replay_lsn — the last position the standby has applied to its data files (what a failover would promote from).
The gap between the primary's current WAL position and each of those milestones is your lag at that stage of the pipeline. Network lag shows up between sent_lsn and the primary's current LSN. Disk lag appears between write_lsn and flush_lsn. Apply lag — the one that matters most for failover — lives between flush_lsn and replay_lsn.
On a healthy standby with a fast SSD and low write volume, write_lsn, flush_lsn, and replay_lsn are typically within milliseconds of each other. A growing gap between flush_lsn and replay_lsn specifically points to CPU pressure or a long-running conflict on the standby, not a network or disk problem.
Monitoring Replication Lag on the Primary
The pg_stat_replication view on the primary is the single authoritative source for per-standby lag. Each row represents one active WAL sender connection — one standby (or one cascading replica downstream of a standby).
-- Run on the primary
SELECT
client_addr,
application_name,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication
ORDER BY replay_lag_bytes DESC NULLS LAST;pg_wal_lsn_diff(a, b) returns a signed numeric value in bytes representing how far b lags behind a. A positive result means the standby is behind; zero means it is fully caught up. PostgreSQL 10+ also populates the write_lag, flush_lag, and replay_lag interval columns with human-readable time deltas, which are useful for alerting but are wall-clock estimates, not exact measurements.
pg_stat_replication only shows connected standbys. If a standby's WAL receiver process crashes or the network drops, the row disappears entirely. Absence of a row is itself a critical alert condition. Always compare the number of rows in pg_stat_replication against your expected standby count.
For a quick operational check, this query returns standbys whose replay lag exceeds 60 seconds — a sensible default alert threshold for most OLTP workloads:
SELECT
application_name,
client_addr,
replay_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication
WHERE replay_lag > interval '60 seconds'
OR pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) > 100 * 1024 * 1024; -- 100 MBReplication Slots: What They Are and Why They Bite
A replication slot is a server-side bookmark that tells PostgreSQL: "do not recycle WAL segments until this consumer has confirmed it has processed them." Slots are used by logical replication subscribers, by pgoutput and wal2json decoders, and optionally by physical standbys. They solve the problem of WAL being recycled before a slow or temporarily disconnected consumer has caught up.
The danger is the flip side of that guarantee: if a slot's consumer disappears and no one notices, PostgreSQL will retain every WAL segment generated since the consumer disconnected. On a busy primary, that can fill the pg_wal directory in hours and crash the entire cluster when the disk hits 100%.
Monitoring Replication Slots
The pg_replication_slots view exposes every slot on the primary, whether it has an active connection or not:
-- Run on the primary
SELECT
slot_name,
plugin,
slot_type,
active,
active_pid,
confirmed_flush_lsn,
restart_lsn,
pg_wal_lsn_diff(
pg_current_wal_lsn(),
COALESCE(confirmed_flush_lsn, restart_lsn)
) AS retained_wal_bytes,
pg_size_pretty(
pg_wal_lsn_diff(
pg_current_wal_lsn(),
COALESCE(confirmed_flush_lsn, restart_lsn)
)
) AS retained_wal_pretty
FROM pg_replication_slots
ORDER BY retained_wal_bytes DESC;Key columns to understand:
- active —
trueif a process is currently connected and consuming the slot.falsemeans the slot is abandoned and bloating. - confirmed_flush_lsn — for logical slots, the LSN up to which the subscriber has confirmed receipt. This is the authoritative lag measurement for logical replication.
- restart_lsn — the oldest WAL position PostgreSQL must retain for this slot. For physical slots, this is the relevant lag boundary.
Slot bloat is one of the most common causes of unplanned PostgreSQL downtime. The formula for retained WAL bytes is pg_current_wal_lsn() - confirmed_flush_lsn (for logical) or pg_current_wal_lsn() - restart_lsn (for physical). Alert at 2 GB of retained WAL, page at 5 GB. If a slot is inactive and has accumulated more than a few hundred megabytes, drop it immediately unless you have a specific reason to preserve it.
To isolate inactive slots specifically:
SELECT
slot_name,
slot_type,
active,
pg_size_pretty(
pg_wal_lsn_diff(
pg_current_wal_lsn(),
COALESCE(confirmed_flush_lsn, restart_lsn)
)
) AS bloat_size
FROM pg_replication_slots
WHERE active = false
ORDER BY pg_wal_lsn_diff(
pg_current_wal_lsn(),
COALESCE(confirmed_flush_lsn, restart_lsn)
) DESC;You can also set max_slot_wal_keep_size (PostgreSQL 13+) to cap how much WAL a slot is allowed to retain before PostgreSQL invalidates it automatically. This is a safety valve, not a substitute for active monitoring:
-- postgresql.conf
max_slot_wal_keep_size = '10GB'Monitoring the WAL Receiver on Standbys
While pg_stat_replication gives you the primary's perspective, pg_stat_wal_receiver gives you the standby's perspective. Run this on each standby to confirm the receiver is alive and what it reports about its connection:
-- Run on each standby
SELECT
status,
receive_start_lsn,
received_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_lsn,
latest_end_time,
sender_host,
sender_port,
conninfo
FROM pg_stat_wal_receiver;If this view returns zero rows, the WAL receiver process is not running — the standby is not receiving any data from the primary. This is a critical alert: the standby is accumulating lag silently. The status column will be streaming during normal operation or catchup when the standby is recovering from a lag spike.
Compare last_msg_receipt_time against now(). If the last message from the primary was received more than wal_receiver_timeout ago (default: 60 seconds), the connection has silently stalled. PostgreSQL will eventually reconnect, but the gap in coverage is a problem you want to catch proactively.
Synchronous Commit Modes and Their Monitoring Implications
The synchronous_commit parameter controls how far through the replication pipeline a transaction must propagate before PostgreSQL returns a commit acknowledgment to the client. The available modes and their durability guarantees are:
- off — commit returns immediately; data can be lost on primary crash (not just standby failure).
- local — commit waits for the primary's local WAL flush only; standby lag is irrelevant to commit latency.
- remote_write — commit waits until the standby has written (not flushed) the WAL; survives primary crash but not simultaneous primary+standby OS crash.
- remote_apply — commit waits until the standby has replayed the WAL; reads on the standby are guaranteed to see the transaction immediately after commit.
- on (default) — commit waits until the standby has flushed the WAL to disk.
From a monitoring standpoint, synchronous replication means pg_stat_replication.sync_state will show sync for the designated synchronous standby. If that standby's flush lag grows, it starts delaying commits on the primary. This is surfaced as increased transaction latency — not as lag in the replication views themselves. Monitor both simultaneously:
SELECT
application_name,
sync_state,
flush_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes
FROM pg_stat_replication
WHERE sync_state = 'sync';If synchronous_standby_names names a standby that goes down and synchronous_commit is on or stricter, every write transaction on the primary will hang until wal_sender_timeout elapses and PostgreSQL demotes the standby. Set synchronous_commit = local as the per-session fallback for non-critical workloads, or use ANY 1 (standby1, standby2) quorum syntax to avoid a single point of failure in your sync standby list.
Failover Readiness Checks
Confirming that a standby is "ready" for promotion requires more than checking that lag is low. Use the following checklist before a planned failover and include it in your runbooks for unplanned failovers:
1. Verify replay lag is within tolerance
-- On primary: confirm replay lag is under 5 seconds and 50 MB
SELECT
application_name,
replay_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication
WHERE application_name = 'standby1';2. Confirm no inactive slots exist on the standby-to-be-promoted
-- On primary
SELECT count(*) FROM pg_replication_slots WHERE active = false;3. Check the standby's WAL receiver is streaming, not in catchup
-- On standby
SELECT status FROM pg_stat_wal_receiver;
-- Expected: 'streaming'4. Confirm the standby is in hot standby mode and accepting reads
-- On standby
SELECT pg_is_in_recovery();
-- Returns true if standby, false if already promoted (misconfiguration)5. If using Patroni, check the REST health endpoint
Patroni exposes HTTP endpoints that your load balancer and monitoring stack should query continuously:
# Primary health check — returns 200 only when the node is the leader
curl -s http://patroni-node1:8008/primary | jq .
# Replica health check — returns 200 only when the node is a healthy replica
curl -s http://patroni-node2:8008/replica | jq .
# Full cluster status — shows all nodes, their roles, lag, and timeline
curl -s http://patroni-node1:8008/cluster | jq .members[].lagPatroni's /health endpoint returns a JSON body with lag in bytes, timeline (must match the primary's timeline for clean failover), and state (running vs stopped). A standby is ready to promote cleanly only when its timeline matches the primary's and its lag is within your RPO budget.
Configure your load balancer (HAProxy, pgBouncer, or AWS NLB target group health checks) to use Patroni's /primary and /replica endpoints rather than raw TCP checks. This ensures writes never reach a replica and reads never route to an unhealthy standby, even during a failover in progress.
Recommended alerting thresholds
| Metric | Warn | Page |
|---|---|---|
| Replay lag (time) | > 30 s | > 120 s |
| Replay lag (bytes) | > 50 MB | > 500 MB |
| Retained WAL per slot | > 2 GB | > 5 GB |
| Inactive slots | Any | Any |
| Missing standby from pg_stat_replication | — | Immediately |
| pg_stat_wal_receiver returns 0 rows | — | Immediately |
pg_stat_replicationon the primary tracks all four LSN milestones per standby; usepg_wal_lsn_diff()to convert them to byte counts for alerting.- A missing row in
pg_stat_replicationis as critical as high lag — it means the standby is disconnected and accumulating lag silently. - Replication slot bloat is a leading cause of unplanned outages; drop inactive slots promptly and set
max_slot_wal_keep_sizeas a safety net. pg_stat_wal_receiveron the standby confirms the receiver process is alive and when it last heard from the primary.- Synchronous standbys protect against data loss but introduce commit latency; use quorum-based
synchronous_standby_namesto avoid a single standby blocking all writes. - Patroni's REST API provides timeline, lag, and role information that should drive both load balancer health checks and your monitoring dashboards.
- Failover readiness is a checklist, not a single metric — replay lag, slot health, WAL receiver status, timeline, and Patroni state must all be green simultaneously.
How JusDB Helps You Stay on Top of Replication Health
Setting up these queries is the starting point. Keeping them running continuously, correlating spikes with application events, and maintaining runbooks across team members as your cluster evolves is where most teams hit friction. JusDB gives you a managed PostgreSQL environment with built-in replication monitoring, slot bloat alerts, and Patroni integration out of the box — so your team spends time building product instead of writing cron jobs to check pg_replication_slots.
If you are running PostgreSQL in production and want visibility into your replication health without standing up a full observability stack yourself, take a look at what JusDB offers. Whether you need a fully managed cluster or just want expert guidance on hardening your existing setup, the JusDB team has worked through these exact problems at scale.