Three months into a zero-downtime migration project, a fintech client discovered that their replica had silently fallen 14 hours behind the primary. The culprit: a logical replication slot created during a failed migration test had been left inactive, holding WAL segments hostage. Their 200 GB data directory had quietly ballooned to over 600 GB overnight, and disk space was gone before any alert fired. The primary did not crash, but without room to write new WAL, it would have within the hour. This is the kind of production failure that replication documentation glosses over—until it happens to you.
PostgreSQL ships with two fundamentally different replication mechanisms, each designed for a different class of problem. Streaming replication (physical) has been the workhorse of PostgreSQL HA since version 9.0. Logical replication, introduced in version 10, unlocked selective table replication, cross-version upgrades, and integration with external systems. Both depend on the Write-Ahead Log (WAL), and both use replication slots to track consumer progress.
Understanding the distinction between physical and logical replication—and the operational hazards of replication slots—is not optional knowledge for anyone running PostgreSQL in production. This guide covers all three: how they work, how to configure them, and where they bite.
- Streaming replication copies the entire WAL stream byte-for-byte to create a physical standby; the replica is an exact binary copy of the primary and can serve read queries or be promoted to primary.
- Logical replication decodes WAL into row-level changes and replicates only specified tables, enabling cross-version upgrades, selective sync, and pub/sub pipelines.
- Replication slots guarantee a consumer never misses WAL data, but an inactive slot will accumulate WAL indefinitely—this is one of the most dangerous silent failure modes in PostgreSQL.
- Key streaming replication configs:
wal_level = replica,max_wal_senders,primary_conninfo, andrecovery_target_timeline. - Logical replication uses
CREATE PUBLICATIONon the primary andCREATE SUBSCRIPTIONon the subscriber; requireswal_level = logical. - Monitor
pg_replication_slotsandpg_stat_replicationdaily in production; setmax_slot_wal_keep_sizeas a disk-protection safety valve.
Background: What PostgreSQL Replication Actually Does
Every write in PostgreSQL is first recorded in the Write-Ahead Log before it is applied to the data files. This is how PostgreSQL guarantees crash safety: even if the server dies mid-write, the WAL contains enough information to replay and complete or roll back the operation. Replication is the mechanism by which that same WAL stream is consumed by another process or server.
The WAL is a sequence of binary records stored in 16 MB segment files in the pg_wal directory. Each record describes a change at the storage level: a page modification, a heap tuple insertion, a B-tree index update. Streaming replication ships these raw binary records to a standby server, which applies them to its own data files identically. The result is a byte-for-byte replica of the primary.
Logical replication takes a different path. It uses a logical decoding plugin (the built-in pgoutput plugin, or third-party plugins like wal2json) to interpret the low-level WAL records and reconstruct them as higher-level, row-oriented change events: INSERT, UPDATE, DELETE. These decoded changes are then replicated to subscribers, which apply them using normal SQL DML. The subscriber does not need to be a byte-for-byte copy of the primary—it can run a different PostgreSQL version, have different indexes, or contain only a subset of tables.
How Replication Works: Physical vs Logical
Streaming Replication (Physical)
Physical streaming replication establishes a persistent TCP connection between a WAL sender process on the primary and a WAL receiver process on the standby. The WAL sender continuously reads new WAL segments and streams them to the receiver, which writes them to the standby's pg_wal directory and signals the startup process to apply them.
In synchronous mode, the primary waits for the standby to confirm it has written (and optionally flushed or applied) each WAL record before acknowledging the transaction to the client. This provides zero data loss at the cost of added latency on every write. In asynchronous mode (the default), the primary commits immediately and the standby catches up in the background—replication lag is possible, and a crash before a WAL segment is delivered means that data is lost.
A physical standby runs in continuous recovery mode. It can serve read-only queries (hot standby) but cannot accept writes. It can be promoted to primary using pg_promote() or a trigger file, at which point it stops recovering and begins accepting writes.
Logical Replication
Logical replication introduces a publisher/subscriber model. The primary defines a publication: a named set of tables (or ALL TABLES) whose changes should be replicated. A subscriber connects to the primary, creates a subscription referencing that publication, and begins receiving decoded change events.
The subscription process first performs an initial table sync: it copies the current contents of each published table to the subscriber via a COPY stream. Once initial sync is complete, it switches to streaming mode and applies ongoing changes in order.
Logical replication is significantly more flexible than physical: the subscriber can have different column names, extra columns, different data types (with compatible casts), and different indexes. It can also replicate from a primary to a subscriber that is itself a primary for other standbys. This makes logical replication the foundation for live major-version upgrades and heterogeneous data pipelines.
Replication Slots
A replication slot is a persistent server-side object that tracks how far a replication consumer has consumed the WAL. The slot records the oldest WAL location the consumer still needs. PostgreSQL will never remove WAL segments that are still required by an active slot. This guarantees that a slow or temporarily disconnected replica will not miss changes when it reconnects.
Both physical and logical replication can use slots. Physical standbys use slots to ensure the primary retains WAL even if the standby falls behind. Logical subscribers automatically create a slot on the primary when a subscription is created. The slot persists until explicitly dropped.
If a replication slot's consumer stops consuming—because the subscriber was dropped without cleaning up, a migration script failed halfway, or the consumer application crashed—the slot continues to hold WAL indefinitely. There is no automatic expiration. A slot that was active three months ago and is now dormant is silently accumulating every WAL segment produced since it was last consumed. On a busy primary, this can fill the disk in hours. Always monitor pg_replication_slots WHERE active = false and drop inactive slots promptly.
Configuration and Setup
Step 1: Configure the Primary for Streaming Replication
Edit postgresql.conf on the primary. These are the minimum required settings:
# Minimum WAL level for streaming replication wal_level = replica # How many simultaneous WAL sender processes to allow # Set to (number of standbys) + (slots for tools like pg_basebackup) + 2 buffer max_wal_senders = 5 # How many replication slots to allow max_replication_slots = 5 # Retain WAL for standbys even if they disconnect (0 = disabled, use slots instead) wal_keep_size = 0 # Safety valve: if a slot causes WAL to grow beyond this, invalidate the slot # rather than filling the disk. Set to something reasonable, e.g. 10GB. # Available in PostgreSQL 13+ max_slot_wal_keep_size = 10GB
Then allow replication connections in pg_hba.conf:
# Allow the replication user from the standby host # TYPE DATABASE USER ADDRESS METHOD host replication replicator 192.168.1.0/24 scram-sha-256
Step 2: Create a Replication User
-- Create a dedicated replication user (never use a superuser for replication)
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'strong_password_here';Step 3: Take a Base Backup and Start the Standby
On the standby server, take a base backup from the primary:
# Run on the standby server # -R writes postgresql.auto.conf and standby.signal automatically pg_basebackup \ --host=primary-host \ --username=replicator \ --pgdata=/var/lib/postgresql/16/main \ --wal-method=stream \ --write-recovery-conf \ --checkpoint=fast \ --progress \ --verbose
--write-recovery-conf automatically creates the standby.signal file and writes the primary_conninfo setting into postgresql.auto.conf. Start the standby PostgreSQL instance normally; it will enter recovery mode and begin streaming.
Step 4: Verify Streaming Replication Status
-- On the primary: check connected standbys
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag,
sync_state
FROM pg_stat_replication;-- On the standby: check recovery status and lag
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_lag,
pg_is_in_recovery() AS is_standby,
pg_last_wal_receive_lsn() AS received_lsn,
pg_last_wal_replay_lsn() AS replayed_lsn;Add synchronous_standby_names = 'standby1' to postgresql.conf on the primary and set synchronous_commit = on (the default). Transactions will not commit until the named standby confirms it has written the WAL record. For maximum protection, use synchronous_commit = remote_apply to wait until the standby has also replayed the change. Expect 1–5ms of additional commit latency depending on network round-trip time.
Step 5: Set Up Logical Replication
Logical replication requires wal_level = logical on the primary (a superset of replica):
# postgresql.conf on the primary wal_level = logical max_replication_slots = 10 max_wal_senders = 10
Restart PostgreSQL after changing wal_level. Then create a publication on the primary:
-- Replicate specific tables
CREATE PUBLICATION orders_pub
FOR TABLE orders, order_items, customers;
-- Or replicate all tables in the database
CREATE PUBLICATION all_tables_pub FOR ALL TABLES;
-- Replicate only INSERT and UPDATE (exclude DELETE, e.g. for an audit replica)
CREATE PUBLICATION inserts_only_pub
FOR TABLE transactions
WITH (publish = 'insert, update');
-- Verify
SELECT pubname, puballtables, pubinsert, pubupdate, pubdelete
FROM pg_publication;Step 6: Create the Subscription on the Subscriber
The subscriber must have the target tables already created with compatible schemas (logical replication does not replicate DDL):
-- Run on the subscriber database
-- The subscription connects to the primary and creates a replication slot automatically
CREATE SUBSCRIPTION orders_sub
CONNECTION 'host=primary-host port=5432 dbname=mydb user=replicator password=strong_password_here'
PUBLICATION orders_pub;
-- Verify subscription status
SELECT subname, subenabled, subslotname, subpublications
FROM pg_subscription;
-- Check sync status for each table
SELECT
sr.subname,
sr.relid::regclass AS table_name,
sr.srsubstate,
CASE sr.srsubstate
WHEN 'i' THEN 'initializing'
WHEN 'd' THEN 'data copy in progress'
WHEN 'f' THEN 'finished table copy'
WHEN 's' THEN 'synchronized'
WHEN 'r' THEN 'ready (streaming)'
END AS state_desc
FROM pg_subscription_rel sr
JOIN pg_subscription s ON s.oid = sr.srsubid;Step 7: Monitor Replication Slots
-- View all replication slots and their WAL retention
SELECT
slot_name,
slot_type,
database,
active,
active_pid,
restart_lsn,
confirmed_flush_lsn,
-- How much WAL is being retained by this slot
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal,
-- When was this slot last active (PostgreSQL 14+)
inactive_since
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;-- Alert query: slots retaining more than 5GB of WAL
SELECT slot_name, active, pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal
FROM pg_replication_slots
WHERE pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 5 * 1024^3;-- Drop an inactive slot that is no longer needed
-- WARNING: this permanently removes the slot; the consumer will need to resync
SELECT pg_drop_replication_slot('slot_name_here');Performance and Best Practices
WAL apply parallelism on standbys. By default, a physical standby applies WAL records in a single process. For write-heavy primaries, the standby can fall behind under high load. PostgreSQL 14+ supports parallel WAL apply on standbys: set recovery_min_apply_delay and tune max_parallel_replay_workers. For logical replication, PostgreSQL 16 supports parallel apply via max_logical_replication_workers and the apply_delay subscription parameter.
Replica for read scaling. Physical standbys in hot standby mode can absorb read traffic. Route long-running analytics queries and reporting to the standby to reduce load on the primary. Be aware that the standby may return slightly stale data depending on replication lag. For queries that require absolute freshness, direct them to the primary.
Logical replication throughput tuning. Large initial syncs (millions of rows) can be accelerated by increasing wal_sender_timeout and setting logical_decoding_work_mem higher (default 64MB) on the primary. The logical worker on the subscriber can be tuned with max_sync_workers_per_subscription (PostgreSQL 16+) to parallelize the initial table copy.
Schema changes (ALTER TABLE, CREATE INDEX, DROP COLUMN) are not replicated by logical replication. If you add a column to a published table on the primary, you must manually apply the same DDL on the subscriber before the subscription will continue working. A common pattern is to use a migration tool (Flyway, Liquibase) that runs DDL on all databases in the correct order. Forgetting this step will pause the subscription and cause the replication slot to accumulate WAL until the schema mismatch is resolved.
Cascading standbys. Physical standbys can themselves act as sources for further standbys (recovery_target_timeline = 'latest', and primary_conninfo pointing to the upstream standby). This offloads WAL sender processes from the primary when running many replicas. The intermediate standby must have wal_level = replica or higher and max_wal_senders configured.
Table identity for logical replication. Logical replication requires that updated or deleted rows can be uniquely identified on the subscriber. Tables should have a primary key. If they do not, set REPLICA IDENTITY FULL on the table, which causes the full old row to be included in WAL for every UPDATE and DELETE. This is significantly more WAL-intensive; use it only when no primary key exists and cannot be added.
-- Set replica identity for a table without a primary key
ALTER TABLE legacy_audit_log REPLICA IDENTITY FULL;
-- Or use a unique index as the identity
CREATE UNIQUE INDEX ON events (event_uuid);
ALTER TABLE events REPLICA IDENTITY USING INDEX events_event_uuid_idx;Streaming vs Logical Replication: When to Use Each
| Criteria | Streaming (Physical) | Logical |
|---|---|---|
| Use case | HA failover, read replicas, PITR standbys | Cross-version upgrades, selective sync, data pipelines, multi-master |
| What is replicated | Entire database cluster (all databases, all objects) | Specific tables in specific databases |
| PostgreSQL version requirement | Same major version on primary and standby | Subscriber can be a different major version (newer only) |
| DDL replication | Yes (it is part of the WAL byte stream) | No. DDL must be applied manually on subscribers. |
| Standby accepts writes | No (read-only until promoted) | Yes (subscriber is a normal writable database) |
| Initial setup cost | Base backup required (can be large) | Initial table copy via COPY stream (per-table, more resumable) |
| WAL overhead | Low (wal_level = replica) |
Higher (wal_level = logical writes more WAL per change) |
| Replication slot risk | Lower (slots optional for physical) | Higher (subscriptions always use slots; inactive slot = WAL bloat) |
Key Takeaways
- Streaming replication is the right choice for HA standbys and read replicas within the same major version. It is binary, efficient, and automatically replicates DDL.
- Logical replication is the right choice for cross-version migrations, selective table sync, and feeding downstream systems. It requires more careful operational discipline.
- Replication slots are powerful but dangerous if left unmonitored. An inactive slot holding WAL can crash a production primary by filling the disk. Always set
max_slot_wal_keep_sizeas a guard and alert on inactive slots. - For major-version upgrades, logical replication enables a live cutover: run old and new versions in parallel, replicate data to the new version, then flip the application connection string with minimal downtime.
- Always use a dedicated replication role with
REPLICATION LOGINrather than a superuser. Restrict it inpg_hba.confto specific IP ranges. - Monitor
pg_stat_replication,pg_replication_slots, andpg_subscription_relas part of your standard database observability stack. Replication lag and slot bloat should be first-class alerts alongside CPU and disk.
Working with JusDB on PostgreSQL Replication
Getting replication right in production means more than following the documentation. It means deciding between synchronous and asynchronous replication for your specific latency budget, designing a slot management policy before you accumulate technical debt, tuning WAL apply parallelism for your write volume, and building the runbook for failover before the primary goes down at 2 AM.
JusDB's PostgreSQL consultants have designed and operated replication topologies across high-traffic SaaS platforms, financial systems, and regulated data environments. We cover streaming replication architecture, logical replication pipelines for major-version upgrades, replication slot monitoring, and failover automation with tools like Patroni.
PostgreSQL Consulting Services Talk to a PostgreSQL Expert
Whether you are standing up your first streaming replica, executing a zero-downtime major-version upgrade with logical replication, or troubleshooting a replication lag incident, we can help at any stage of the process.
Related Reading
- PostgreSQL Point-in-Time Recovery (PITR) — how WAL archiving and base backups combine with replication for a complete data protection strategy
- Patroni: PostgreSQL High Availability — automating leader election and failover on top of streaming replication
- PgBouncer: PostgreSQL Connection Pooling — managing connections across primary and read replicas efficiently