It was 11:04 PM on a Friday when the primary PostgreSQL node for a mid-sized e-commerce company silently died — a kernel panic on bare metal with no automatic recovery path. Orders were queuing in memory, payment callbacks were timing out, and the on-call DBA was digging through pg_controldata output while Slack notifications piled up. Forty-seven minutes later, after manually promoting the standby, updating the connection string in the application config, and reloading HAProxy, the site came back. The post-mortem conclusion was blunt: the replica had been ready and fully caught up within eight seconds of the primary going down, but there was nothing to act on that information. Patroni would have completed the entire failover in under thirty seconds, automatically, while the DBA slept.
- Patroni is a Python daemon that wraps PostgreSQL and uses a distributed configuration store (DCS) — etcd, Consul, or ZooKeeper — to elect a single leader and automate failover across a cluster of nodes.
- It is not a proxy or load balancer; it controls
pg_ctlandpg_rewindon each node and exposes a REST API on port 8008 that downstream tools like HAProxy query to determine which node is currently the primary. - A minimal production cluster requires three Patroni nodes (or two nodes plus an external DCS quorum node) so the DCS can always reach majority agreement without a split-brain scenario.
patronictl list,switchover,failover,reinit, andpause/resumeare the core operator commands for day-to-day management.- HAProxy uses Patroni's
/masterand/replicahealth endpoints to route read/write traffic to the correct node without any manual reconfiguration after a failover. - Key tuning knobs are
ttl(leader lock expiry, default 30 s),loop_wait(health-check interval, default 10 s), andmaximum_lag_on_failover(replica lag gate); enablepg_rewindso a failed primary can rejoin without a fullpg_basebackup.
What is Patroni
Patroni is an open-source high-availability solution for PostgreSQL maintained by Zalando. At its core it is a Python process — one instance per database node — that continuously monitors PostgreSQL's health, manages leader election through a DCS, and performs or triggers failover when the primary becomes unavailable. It is not a proxy, not a middleware layer, and not a fork of PostgreSQL. It is an orchestration daemon that talks directly to the PostgreSQL process on its own host and to the DCS cluster for coordination with peer nodes.
The critical distinction is where authority lives. In a traditional streaming replication setup, PostgreSQL itself has no knowledge of the other nodes in your cluster; a human or an external script must decide when to promote a replica and how to notify the rest of the system. Patroni externalizes that authority to the DCS, which is a distributed, strongly-consistent key-value store. The node that currently holds a specific key in the DCS — the leader lock — is the primary. Every other node is a replica. If the leader cannot renew its lock within the configured ttl window because it is down or partitioned, the DCS expires the key and a replica wins a new election. The whole mechanism is deterministic and does not require any human intervention.
Patroni works alongside your existing PostgreSQL installation — it does not replace pg_ctl, it calls it. You can adopt Patroni on an existing replication setup without changing your PostgreSQL major version or data directory layout.
Patroni Architecture
The DCS (etcd / Consul / ZooKeeper) as the Source of Truth
Every Patroni node must be able to reach the DCS. etcd is by far the most common choice because it provides strong consistency via the Raft protocol, has a simple HTTP API, and deploys cleanly as a three- or five-node cluster. Consul is a viable alternative, especially if your organization already uses it for service discovery. ZooKeeper works but adds operational complexity. Whichever DCS you choose, it must itself be highly available — a DCS cluster that cannot reach quorum will prevent all leader elections and block new failovers.
Patroni stores its cluster state under a namespace key in the DCS. The most important sub-key is /leader, which holds the name of the current primary node. The current leader writes its own node name into this key and refreshes it every loop_wait seconds. If the refresh does not happen within ttl seconds, the DCS expires the key. Replicas watch this key and begin an election the moment it disappears.
The Leader Election Loop
When the leader lock expires, every replica that is eligible to be promoted — meaning its replication lag is within the maximum_lag_on_failover threshold — attempts to acquire the lock by writing its node name to the DCS key using a compare-and-swap operation. Only one write can succeed. The winner calls pg_ctl promote on its local PostgreSQL instance, which ends recovery mode and opens the instance for read-write connections. All other replicas detect the new leader name in the DCS and reconfigure their primary_conninfo to stream from the new primary.
The entire election loop — from lock expiry detection to promotion — typically completes in fewer than 30 seconds with default settings on a well-connected network. This is a fraction of the time required for even the fastest manual failover.
Patroni Agent on Each Node
Each node runs one Patroni process alongside one PostgreSQL process. Patroni starts, stops, and configures PostgreSQL according to the DCS state. It also exposes a REST API on port 8008 (by default) with endpoints including /health, /master, /replica, and /patroni. The /master endpoint returns HTTP 200 only on the current primary, and HTTP 503 on all replicas. The /replica endpoint returns HTTP 200 only on healthy replicas. These endpoints are the integration point for load balancers like HAProxy.
Setting Up a 3-Node Patroni Cluster
Step 1: Install Patroni and etcd
The following commands install Patroni with the etcd driver and a compatible version of the etcd client library on each PostgreSQL node. Run a separate etcd cluster — either on dedicated hosts or co-located on the same three nodes — before starting Patroni.
# On each Patroni node (Ubuntu/Debian example)
sudo apt-get install -y python3-pip python3-venv postgresql-16
# Create a virtual environment to isolate Patroni dependencies
python3 -m venv /opt/patroni
source /opt/patroni/bin/activate
# Install Patroni with the etcd3 DCS driver
pip install patroni[etcd3]
# Verify installation
patroni --version
# On your etcd nodes (can be the same 3 machines)
sudo apt-get install -y etcd-server etcd-client
# Start etcd on node 1 (adjust IPs for nodes 2 and 3)
etcd \
--name etcd1 \
--initial-advertise-peer-urls http://10.0.0.1:2380 \
--listen-peer-urls http://10.0.0.1:2380 \
--advertise-client-urls http://10.0.0.1:2379 \
--listen-client-urls http://10.0.0.1:2379,http://127.0.0.1:2379 \
--initial-cluster etcd1=http://10.0.0.1:2380,etcd2=http://10.0.0.2:2380,etcd3=http://10.0.0.3:2380 \
--initial-cluster-state new \
--initial-cluster-token pg-etcd-clusterStep 2: patroni.yml Configuration
The following is a production-ready annotated patroni.yml for the first node. Adjust name, connect_address, and postgresql.data_dir for each subsequent node. Save to /etc/patroni/patroni.yml on each host.
# /etc/patroni/patroni.yml — Node 1 (pg-node-1)
scope: pg-production # Cluster name — shared by all nodes
namespace: /patroni/ # DCS key prefix
name: pg-node-1 # Unique name for this node
restapi:
listen: 0.0.0.0:8008 # Patroni REST API — all interfaces
connect_address: 10.0.0.1:8008
etcd3:
hosts:
- 10.0.0.1:2379
- 10.0.0.2:2379
- 10.0.0.3:2379
bootstrap:
# DCS-stored cluster configuration — applies to all nodes
dcs:
ttl: 30 # Leader lock expiry in seconds (default: 30)
loop_wait: 10 # Health-check loop interval (default: 10)
retry_timeout: 10 # DCS operation retry window
maximum_lag_on_failover: 1048576 # 1 MB — replicas lagging more are excluded
postgresql:
use_pg_rewind: true # Allow demoted primary to rejoin without pg_basebackup
use_slots: true
parameters:
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
wal_log_hints: "on" # Required for pg_rewind
archive_mode: "on"
archive_command: "test ! -f /mnt/wal-archive/%f && cp %p /mnt/wal-archive/%f"
# SQL executed only on the very first bootstrap of the cluster
initdb:
- encoding: UTF8
- data-checksums # Enable page checksums — required for pg_rewind
pg_hba:
- host replication replicator 10.0.0.0/24 scram-sha-256
- host all all 0.0.0.0/0 scram-sha-256
users:
admin:
password: "changeme-admin"
options:
- createrole
- createdb
replicator:
password: "changeme-replicator"
options:
- replication
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.0.1:5432
data_dir: /var/lib/postgresql/16/main
bin_dir: /usr/lib/postgresql/16/bin
pgpass: /var/lib/postgresql/.pgpass
authentication:
replication:
username: replicator
password: "changeme-replicator"
superuser:
username: postgres
password: "changeme-postgres"
rewind:
username: rewind_user
password: "changeme-rewind"
# Node-local postgresql.conf overrides
parameters:
max_connections: 200
shared_buffers: 2GB
effective_cache_size: 6GB
maintenance_work_mem: 512MB
checkpoint_completion_target: 0.9
wal_buffers: 64MB
default_statistics_target: 100
watchdog:
mode: required # Fail safe: if watchdog unavailable, node won't become leader
device: /dev/watchdog # Linux software watchdog — extra split-brain protection
safety_margin: 5 # Seconds of margin before watchdog fireswal_log_hints: on and data-checksums must be enabled for pg_rewind to work. Without pg_rewind, a failed primary that was ahead of the new leader cannot rejoin the cluster — it must rebuild its data directory from scratch via pg_basebackup, which can take hours on large databases.
Step 3: Bootstrap the Primary
# On node 1 only — this initializes the PostgreSQL data directory
# and registers this node as the initial leader in the DCS
sudo -u postgres /opt/patroni/bin/patroni /etc/patroni/patroni.yml
# Verify in another terminal that the node came up as leader
patronictl -c /etc/patroni/patroni.yml list
# Expected output:
# + Cluster: pg-production (7234567890123456789) ----+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +------------+---------------+---------+---------+----+-----------+
# | pg-node-1 | 10.0.0.1:5432 | Leader | running | 1 | |
# +------------+---------------+---------+---------+----+-----------+Step 4: Join Replicas
# On node 2 and node 3 — Patroni detects no data directory,
# runs pg_basebackup from the primary, and starts streaming replication
sudo -u postgres /opt/patroni/bin/patroni /etc/patroni/patroni.yml
# After both replicas start, list the cluster again
patronictl -c /etc/patroni/patroni.yml list
# Expected output:
# + Cluster: pg-production (7234567890123456789) ----+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +------------+---------------+---------+---------+----+-----------+
# | pg-node-1 | 10.0.0.1:5432 | Leader | running | 1 | |
# | pg-node-2 | 10.0.0.2:5432 | Replica | running | 1 | 0 |
# | pg-node-3 | 10.0.0.3:5432 | Replica | running | 1 | 0 |
# +------------+---------------+---------+---------+----+-----------+
# Create a systemd service so Patroni starts on boot
sudo tee /etc/systemd/system/patroni.service > /dev/null <<'EOF'
[Unit]
Description=Patroni PostgreSQL HA
After=syslog.target network.target
[Service]
Type=simple
User=postgres
Group=postgres
ExecStart=/opt/patroni/bin/patroni /etc/patroni/patroni.yml
ExecReload=/bin/kill -s HUP $MAINPID
KillMode=process
TimeoutSec=30
Restart=no
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patronipatronictl — Managing Your Cluster
All day-to-day operations go through patronictl, which talks to the Patroni REST API and the DCS. Always specify the config file with -c so it knows which cluster to target.
# Show current cluster state (roles, lag, timeline)
patronictl -c /etc/patroni/patroni.yml list
# Planned switchover — graceful, waits for replica to catch up
# Use this during maintenance windows or controlled primary rotation
patronictl -c /etc/patroni/patroni.yml switchover pg-production \
--master pg-node-1 \
--candidate pg-node-2 \
--scheduled now \
--force
# Forced failover — use only when primary is unresponsive
# Does NOT wait for replica to be fully caught up
patronictl -c /etc/patroni/patroni.yml failover pg-production \
--master pg-node-1 \
--candidate pg-node-2 \
--force
# Reinitialize a replica from scratch (e.g., after data corruption)
patronictl -c /etc/patroni/patroni.yml reinit pg-production pg-node-3
# Pause automatic failover for maintenance (e.g., OS patching)
patronictl -c /etc/patroni/patroni.yml pause pg-production
# Resume automatic failover after maintenance is complete
patronictl -c /etc/patroni/patroni.yml resume pg-production
# Edit DCS-stored cluster configuration live (ttl, loop_wait, etc.)
patronictl -c /etc/patroni/patroni.yml edit-config pg-production
# Show current DCS-stored configuration
patronictl -c /etc/patroni/patroni.yml show-config pg-productionAlways use switchover for planned primary changes, not failover. A failover command does not wait for the replica's replication lag to reach zero, which means you may promote a replica that is missing the most recent transactions. Reserve failover for genuine emergencies when the primary is confirmed dead.
Automatic Failover in Action
Understanding the sequence of events during an unplanned primary failure helps you tune Patroni correctly and set realistic RTO expectations.
T+0 s — Primary dies. The PostgreSQL process on pg-node-1 terminates (or the host becomes unreachable). The Patroni daemon on pg-node-1 either exits or enters an error state and stops renewing the leader lock in the DCS.
T+10 s — Replicas notice the silence. At their next loop_wait iteration (default 10 s), pg-node-2 and pg-node-3 attempt to refresh their view of the DCS leader key. They observe it has not been renewed.
T+30 s — Leader lock expires. After ttl seconds (default 30 s), the DCS automatically expires the leader key. This is the moment an election can begin — the DCS design ensures no replica can steal the lock before ttl elapses, preventing premature promotions if the primary is merely slow.
T+30 s — Election runs. Every replica with a replication lag below maximum_lag_on_failover races to write its name into the DCS leader key using a compare-and-swap operation. On a three-node cluster with two replicas both at zero lag, one wins in milliseconds.
T+31 s — Promotion. The winning replica calls pg_ctl promote. PostgreSQL exits recovery mode and begins accepting read-write connections. Its timeline ID increments by one.
T+32 s — Other replica reconnects. The losing replica reads the new leader name from the DCS, updates its primary_conninfo, and starts streaming from the newly promoted node. No human action required.
T+33 s — HAProxy detects the change. HAProxy's health checks (every 2 s) hit /master on each node. The old primary returns 503 or times out; the new primary returns 200. HAProxy immediately routes all new write connections to pg-node-2.
Total downtime: approximately 30–35 seconds with default settings on a LAN. This window can be reduced by lowering ttl and loop_wait, though doing so requires a stable, low-latency DCS connection to avoid false positives.
Integrating with HAProxy for a Stable Endpoint
Applications should connect to a single stable endpoint rather than directly to PostgreSQL nodes, so they survive failovers without reconfiguration. HAProxy, running on a separate host or as a container sidecar, uses Patroni's REST API health checks to route traffic dynamically.
# haproxy.cfg — route writes to Patroni leader, reads to replicas
global
maxconn 1000
defaults
mode tcp
timeout connect 5s
timeout client 30s
timeout server 30s
retries 3
# Write endpoint — port 5432 — always goes to the current Patroni leader
frontend pg_write
bind *:5432
default_backend pg_primary
backend pg_primary
option httpchk GET /master
http-check expect status 200
default-server inter 2s fall 3 rise 2 on-marked-down shutdown-sessions
server pg-node-1 10.0.0.1:5432 check port 8008
server pg-node-2 10.0.0.2:5432 check port 8008
server pg-node-3 10.0.0.3:5432 check port 8008
# Read endpoint — port 5433 — distributes reads across healthy replicas
frontend pg_read
bind *:5433
default_backend pg_replicas
backend pg_replicas
balance roundrobin
option httpchk GET /replica
http-check expect status 200
default-server inter 2s fall 3 rise 2
server pg-node-1 10.0.0.1:5432 check port 8008
server pg-node-2 10.0.0.2:5432 check port 8008
server pg-node-3 10.0.0.3:5432 check port 8008HAProxy polls each node's Patroni REST API every 2 seconds (inter 2s). When a failover occurs and a replica is promoted to leader, its /master endpoint begins returning 200 and its /replica endpoint returns 503. HAProxy adjusts the backend pool within one polling cycle, typically within 2–4 seconds of the Patroni promotion completing.
The on-marked-down shutdown-sessions directive in the pg_primary backend forces HAProxy to immediately terminate existing connections to a node that fails its health check. Without this, long-running transactions may remain connected to a demoted node until they time out on their own.
# Verify HAProxy sees the correct state after a failover
# From the HAProxy host, check which backend is currently active
echo "show stat" | socat stdio /var/run/haproxy/admin.sock | \
grep -E "pg_primary|OPEN|DOWN"
# Direct REST API checks against each Patroni node
curl -s http://10.0.0.1:8008/master # 200 = current leader
curl -s http://10.0.0.2:8008/master # 503 = not leader
curl -s http://10.0.0.3:8008/replica # 200 = healthy replica
# Get full cluster status from any node's REST API
curl -s http://10.0.0.1:8008/patroni | python3 -m json.toolCommon Patroni Pitfalls
Split-Brain Prevention and DCS Quorum
The most dangerous failure mode in any HA system is split-brain: two nodes both believe they are the primary and both accept writes. Patroni prevents this by making the DCS the single source of truth for leader status. A node that cannot reach the DCS cannot confirm it still holds the leader lock, so it steps down. This is why the DCS cluster must itself have an odd number of nodes with quorum: a three-node etcd cluster can tolerate one node failure; a five-node cluster can tolerate two. Never run a single-node etcd cluster in production — a DCS failure would make your Patroni cluster leaderless.
For additional protection, Patroni supports a hardware or software watchdog device (/dev/watchdog). When Patroni cannot reach the DCS to renew its lock, it deliberately triggers the watchdog, which reboots the machine. This guarantees the old primary cannot accidentally continue accepting writes even if the Patroni process itself hangs.
TTL and loop_wait Tuning
The relationship between ttl and loop_wait determines your failover window and your false-positive rate. The default values (ttl=30, loop_wait=10) are conservative and appropriate for most environments. Lowering ttl to 15 s and loop_wait to 5 s cuts the maximum failover window roughly in half, but increases the risk of spurious failovers caused by momentary network hiccups or a busy host that cannot refresh the DCS lock in time. Tune these values based on measured DCS round-trip times in your environment, not guesswork.
-- After a failover, verify the new primary accepted writes correctly
-- Connect through the HAProxy write endpoint (port 5432)
SELECT pg_is_in_recovery(); -- Should return false on the new primary
-- Check the current timeline ID — it should have incremented after promotion
SELECT timeline_id FROM pg_control_checkpoint();
-- Verify replication is streaming from the new primary to the replicas
SELECT
application_name,
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
(sent_lsn - replay_lsn) AS replication_lag_bytes
FROM pg_stat_replication;pg_rewind Requirements
When a former primary rejoins the cluster as a replica after a failover, its WAL history diverged at the moment it was isolated. Without pg_rewind, Patroni must wipe the data directory and run a full pg_basebackup from the new primary, which can take hours on a large database. With pg_rewind, Patroni rewinds the former primary's data directory to the divergence point and applies only the missing WAL, a process that typically completes in seconds to minutes. The prerequisites are wal_log_hints = on (set in patroni.yml under bootstrap.dcs.postgresql.parameters) and a rewind-capable superuser configured in the postgresql.authentication.rewind section. Both must be in place before a failover occurs — you cannot enable them retroactively after the fact.
The rewind user configured in patroni.yml must have the pg_rewind role granted in PostgreSQL (available since PG 11). Without this, Patroni will fall back to a full pg_basebackup when reinitializing a failed primary, significantly extending the time before the cluster returns to full three-node redundancy.
- Patroni automates PostgreSQL failover by delegating leader authority to an external DCS (etcd), eliminating the need for any human action during a primary failure in the common case.
- A three-node Patroni cluster with a three-node etcd quorum is the minimum for a split-brain-free production deployment; neither the Patroni nodes nor the etcd nodes should run as single instances.
- Default failover time with
ttl=30andloop_wait=10is approximately 30–35 seconds end-to-end; lowering these values reduces RTO but requires a stable, low-latency DCS connection. - HAProxy integrates with Patroni via the
/masterand/replicaREST endpoints on port 8008, providing a stable connection string that survives failovers without any application reconfiguration. - Enable
pg_rewind(wal_log_hints = on,data-checksums, rewind user withpg_rewindrole) before your first failover — it allows a demoted primary to rejoin in minutes rather than hours. - Use
patronictl switchoverfor planned maintenance andpatronictl failoveronly in genuine emergencies; pausing Patroni withpatronictl pauseduring OS patching prevents accidental automated failovers.
Working with JusDB on PostgreSQL High Availability
JusDB designs and operates Patroni clusters for PostgreSQL deployments that cannot afford unplanned downtime. We configure etcd quorum, tune TTL and loop_wait for your network latency, integrate HAProxy, and provide 24/7 failover monitoring.
Explore JusDB PostgreSQL Services → | Talk to a DBA
Related reading: