High Availability

PostgreSQL High Availability with Patroni and HAProxy

Build a production-grade PostgreSQL HA stack using Patroni for failover and HAProxy for connection routing

JusDB Team
January 24, 2023
13 min read
177 views

Running PostgreSQL in production without a high-availability strategy is a single point of failure waiting to become an incident. When your primary node crashes at 2am, the difference between a 30-second automated failover and a 45-minute manual recovery is the difference between a minor blip and a major outage on your SLA report. Patroni, combined with HAProxy, gives you exactly that automated failover — but the configuration surface area is large enough that getting it wrong is easy. This guide walks through every critical piece of a production-grade PostgreSQL HA stack, from Distributed Configuration Store (DCS) selection to HAProxy health checks to watchdog configuration, so you can build it correctly the first time.

TL;DR
  • Patroni manages leader election and streaming replication using a DCS (etcd, Consul, or ZooKeeper) as a distributed lock.
  • HAProxy routes read/write traffic by polling the Patroni REST API (/master and /replica endpoints) rather than checking PostgreSQL directly.
  • The watchdog kernel module is essential in production — without it, a network partition can cause a split-brain where two nodes both believe they are the primary.
  • Failover typically completes in 10–30 seconds depending on ttl, loop_wait, and retry_timeout settings.
  • Patroni supersedes repmgr for most greenfield deployments due to its active management model and REST API.

Architecture Overview

A Patroni cluster has three logical layers: the PostgreSQL instances themselves, the Patroni agents running on each database node, and the Distributed Configuration Store (DCS) that acts as the source of truth for cluster state.

Each Patroni agent runs as a daemon alongside PostgreSQL. The primary node holds a leader key in the DCS with a TTL (time-to-live). Every loop_wait seconds, the primary refreshes that key. If the primary dies and fails to refresh within ttl seconds, the key expires and a replica race begins: each standby attempts to acquire the key, and the winner promotes itself to primary and updates the cluster topology in the DCS.

HAProxy sits in front of all PostgreSQL nodes. Rather than tracking cluster state itself, HAProxy uses HTTP health checks against the Patroni REST API to determine which backend is the current primary and which are replicas. This decouples connection routing entirely from PostgreSQL-level logic.

The three supported DCS options each have tradeoffs:

  • etcd — The most common choice. Lightweight, purpose-built for distributed coordination, and well-documented with Patroni. Recommended for most teams.
  • Consul — Better choice if your organization already runs Consul for service discovery, since you avoid running a separate etcd cluster.
  • ZooKeeper — Mature and battle-tested, but operationally heavier (requires JVM). Only choose this if you already have ZooKeeper expertise on your team.
Warning

Your DCS cluster must be highly available itself. Running a single etcd node defeats the entire purpose — if etcd goes down, Patroni cannot refresh the leader key and your cluster will enter a read-only degraded mode. Run etcd as a 3-node or 5-node cluster on separate hosts from your PostgreSQL nodes.

Installing Patroni

Patroni is a Python package. Install it on every PostgreSQL node using pip, ideally inside a virtual environment to avoid conflicts with system Python packages. You also need the appropriate DCS client library.

bash
# Install Patroni with etcd support
python3 -m venv /opt/patroni
/opt/patroni/bin/pip install patroni[etcd]

# Create the patroni user (do not use postgres user)
useradd --system --home /etc/patroni --shell /bin/false patroni

# Create directories
mkdir -p /etc/patroni /var/log/patroni /data/patroni
chown patroni:patroni /etc/patroni /var/log/patroni /data/patroni

Create a systemd unit file at /etc/systemd/system/patroni.service:

ini
[Unit]
Description=Patroni HA PostgreSQL Agent
After=network.target

[Service]
Type=simple
User=patroni
Group=patroni
ExecStart=/opt/patroni/bin/patroni /etc/patroni/patroni.yml
KillMode=process
TimeoutSec=30
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Patroni Configuration

The patroni.yml file is the heart of your setup. The configuration below covers a 3-node cluster (pg-node-1, pg-node-2, pg-node-3) using etcd. Customize per node by changing name and connect_address.

yaml
scope: pg-production
namespace: /db/
name: pg-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.11:8008

etcd:
  hosts:
    - 10.0.0.1:2379
    - 10.0.0.2:2379
    - 10.0.0.3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1 MB
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_connections: 200
        max_wal_senders: 10
        max_replication_slots: 10
        wal_keep_size: 512
        synchronous_commit: "on"

  initdb:
    - encoding: UTF8
    - data-checksums

  pg_hba:
    - host replication replicator 10.0.1.0/24 scram-sha-256
    - host all all 10.0.0.0/8 scram-sha-256

  users:
    admin:
      password: "changeme_in_vault"
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.11:5432
  data_dir: /data/patroni/pg_data
  bin_dir: /usr/lib/postgresql/16/bin
  pgpass: /tmp/pgpass

  authentication:
    replication:
      username: replicator
      password: "changeme_in_vault"
    superuser:
      username: postgres
      password: "changeme_in_vault"

  parameters:
    unix_socket_directories: "/var/run/postgresql"

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false
Tip

Set maximum_lag_on_failover carefully. A value of 1 MB means Patroni will refuse to promote a replica that is more than 1 MB behind the primary's WAL position. This prevents data loss from promoting a badly lagging standby, but can cause a failover to stall if all replicas are lagging. Monitor replica lag continuously and alert before it approaches this threshold.

The pg_hba.conf replication entry is critical. Without it, replicas cannot connect to stream WAL from the primary:

text
# In bootstrap.pg_hba — applied automatically by Patroni on cluster init
host replication replicator 10.0.1.0/24 scram-sha-256

If you are adding Patroni to an existing cluster, add this line to your existing pg_hba.conf manually and reload PostgreSQL before starting Patroni.

HAProxy Integration

HAProxy uses Patroni's REST API to health-check each node. The primary node returns HTTP 200 on GET /master; replicas return HTTP 503. This means HAProxy's health check drives connection routing without any PostgreSQL-level involvement.

A standard HAProxy configuration for a 3-node Patroni cluster, with separate frontends for read/write and read-only traffic:

haproxy
global
    maxconn 100
    log /dev/log local0
    log /dev/log local1 notice

defaults
    log global
    mode tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

#---------------------------------------------------------------------
# Primary — read/write connections (port 5000)
#---------------------------------------------------------------------
listen pg_primary
    bind *:5000
    option httpchk GET /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg-node-1 10.0.1.11:5432 check port 8008
    server pg-node-2 10.0.1.12:5432 check port 8008
    server pg-node-3 10.0.1.13:5432 check port 8008

#---------------------------------------------------------------------
# Replicas — read-only connections (port 5001)
#---------------------------------------------------------------------
listen pg_replicas
    bind *:5001
    balance roundrobin
    option httpchk GET /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server pg-node-1 10.0.1.11:5432 check port 8008
    server pg-node-2 10.0.1.12:5432 check port 8008
    server pg-node-3 10.0.1.13:5432 check port 8008

#---------------------------------------------------------------------
# HAProxy stats (optional but recommended)
#---------------------------------------------------------------------
listen stats
    bind *:7000
    stats enable
    stats uri /
    stats refresh 5s

Applications connect to port 5000 for read/write workloads and port 5001 for read-only queries. After a failover, HAProxy's health checks detect the new primary within one check interval (3 seconds in this config) and reroute traffic automatically — no application changes required.

Warning

The on-marked-down shutdown-sessions directive terminates existing connections to a backend that fails its health check. This is intentional — you do not want application connections silently writing to a node that is no longer the primary. Make sure your application connection pool handles reconnects gracefully.

Failover and Switchover

Patroni provides the patronictl CLI for cluster management. Always use this tool rather than interacting with PostgreSQL or the DCS directly.

To view current cluster state:

bash
patronictl -c /etc/patroni/patroni.yml list

# Example output:
# + Cluster: pg-production (7891234567890) ---+----+-----------+
# | Member     | Host          | Role    | State   | TL | Lag in MB |
# +------------+---------------+---------+---------+----+-----------+
# | pg-node-1  | 10.0.1.11:5432| Leader  | running |  3 |           |
# | pg-node-2  | 10.0.1.12:5432| Replica | running |  3 |         0 |
# | pg-node-3  | 10.0.1.13:5432| Replica | running |  3 |         0 |
# +------------+---------------+---------+---------+----+-----------+

A planned switchover (graceful, no data loss) — used for maintenance:

bash
patronictl -c /etc/patroni/patroni.yml switchover pg-production \
  --master pg-node-1 \
  --candidate pg-node-2 \
  --scheduled now

A forced failover (emergency, when the primary is unreachable):

bash
patronictl -c /etc/patroni/patroni.yml failover pg-production \
  --master pg-node-1 \
  --candidate pg-node-2 \
  --force

You can also query and trigger failover via the REST API directly, which is useful for automation and integration with external orchestration systems:

bash
# Check which node is the current primary
curl -s http://10.0.1.11:8008/master | jq .

# Check replica status on a standby node
curl -s http://10.0.1.12:8008/replica | jq .

# Trigger switchover via REST API
curl -s -XPOST -H "Content-Type: application/json" \
  http://10.0.1.11:8008/switchover \
  -d '{"leader": "pg-node-1", "candidate": "pg-node-2"}'

Failover timing is controlled by three parameters in your DCS config section. With the values in this guide (ttl: 30, loop_wait: 10, retry_timeout: 10), the worst-case failover time is approximately ttl + retry_timeout = 40 seconds. In practice, it is usually faster because Patroni detects failure before the TTL fully expires. Reducing TTL speeds up failover but increases sensitivity to transient network hiccups causing false positives.

Watchdog Configuration

The watchdog is what separates a production Patroni deployment from a development one. Without a watchdog, a network partition between the primary and the DCS can result in split-brain: the primary cannot refresh its leader key, a replica promotes itself, and now two nodes accept writes simultaneously.

The watchdog is a hardware or software timer. Patroni must actively reset it every few seconds. If Patroni loses contact with the DCS and cannot safely determine cluster state, it stops resetting the watchdog, the timer expires, and the kernel forces a hard reboot of the node — fencing it out of the cluster before the replica promotes.

bash
# Load the software watchdog kernel module (for VMs without hardware watchdog)
modprobe softdog

# Make it persistent across reboots
echo "softdog" >> /etc/modules-load.d/softdog.conf

# Grant the patroni user access to the watchdog device
echo 'KERNEL=="watchdog", OWNER="patroni", GROUP="patroni"' \
  > /etc/udev/rules.d/99-patroni-watchdog.rules

udevadm trigger
ls -la /dev/watchdog  # Verify ownership

Set watchdog.mode: required in patroni.yml (as shown in the configuration section). With required, Patroni refuses to start if it cannot open the watchdog device — preventing a misconfigured node from running without this safety mechanism.

Tip

On cloud instances (AWS EC2, GCP Compute Engine), use the softdog kernel module since hardware watchdog devices are not exposed. On bare metal with a hardware watchdog (e.g., /dev/watchdog0), prefer the hardware device — it persists through kernel panics that softdog cannot survive.

Monitoring

Patroni exposes cluster health through its REST API, which makes monitoring straightforward to integrate with any HTTP-capable monitoring system.

bash
# Health check endpoints — use these in your monitoring system
# Returns 200 if node is the primary, 503 otherwise
GET http://:8008/master

# Returns 200 if node is a healthy replica, 503 otherwise
GET http://:8008/replica

# Full cluster state as JSON
GET http://:8008/cluster

# Node-level status (role, state, lag, timeline)
GET http://:8008/patroni

Key metrics to alert on:

  • Replica lag — available from /patroni response as replication_lag. Alert at 50% of maximum_lag_on_failover.
  • Timeline mismatch — all nodes should share the same timeline ID (timeline field). A mismatch indicates a replica that has not rejoined after a previous failover.
  • DCS connectivity — if Patroni cannot reach the DCS, it logs DCS is not accessible. Alert immediately; this is a pre-condition for split-brain.
  • Cluster member countGET /cluster returns all members. Alert if the expected number of members drops.

For Prometheus users, the patroni_exporter community project scrapes these endpoints and exposes them as Prometheus metrics, making it easy to build Grafana dashboards for replication lag, failover events, and DCS health.

Patroni vs repmgr

repmgr is the other common PostgreSQL HA tool and is worth understanding in contrast to Patroni:

  • Management model — repmgr is passive by default; it monitors and alerts but requires manual or scripted intervention for promotion. Patroni is active — it automatically promotes and reconfigures replicas without human input.
  • Fencing — Patroni has built-in watchdog integration. repmgr relies on external fencing scripts that teams must write and maintain themselves.
  • Configuration management — Patroni stores and distributes PostgreSQL configuration via the DCS, ensuring all nodes share the same settings. With repmgr, you manage postgresql.conf across nodes yourself.
  • REST API — Patroni's REST API is first-class and enables HAProxy integration, external automation, and health monitoring with no additional tooling. repmgr has no equivalent.
  • Operational complexity — Patroni requires a DCS cluster, which is an additional component to operate. repmgr has no external dependencies beyond PostgreSQL itself. If you cannot run etcd or Consul, repmgr may be the more practical choice.

For greenfield deployments where you can run etcd, Patroni is the stronger choice for production due to its automated failover, watchdog integration, and HAProxy-compatible health API.

Key Takeaways
  • Patroni uses a DCS (etcd, Consul, or ZooKeeper) as a distributed lock for leader election — your DCS cluster must itself be highly available on separate nodes.
  • HAProxy polls the Patroni REST API (/master for read/write, /replica for read-only) to route traffic without any PostgreSQL-level awareness.
  • The watchdog (set to mode: required) is non-negotiable in production — it fences the old primary during network partitions and prevents split-brain writes.
  • Failover timing is ttl + retry_timeout in the worst case. Tune these values based on your RTO requirements and tolerance for false-positive failovers.
  • use_pg_rewind: true allows a former primary to rejoin the cluster as a replica after a failover without requiring a full base backup — a significant operational time-saver.
  • The Patroni REST API is the integration surface for HAProxy, monitoring systems, and automation — use it instead of querying PostgreSQL directly for cluster state.
  • For greenfield HA deployments, Patroni is preferred over repmgr due to its active management model, built-in fencing, and first-class REST API.

Managed PostgreSQL Without the Operational Overhead

Building and operating a Patroni cluster — etcd, watchdog configuration, HAProxy health checks, monitoring, and ongoing DCS maintenance — is a significant ongoing investment. If your team's time is better spent on product work than database operations, JusDB provides fully managed PostgreSQL with automated high availability, point-in-time recovery, and connection pooling built in.

You get production-grade HA without running your own Patroni agents, etcd cluster, or HAProxy fleet. Talk to the JusDB team to see how managed PostgreSQL fits your architecture.

Share this article