Database Performance

PostgreSQL Checkpoint and WAL Tuning for High Throughput

Tune PostgreSQL checkpoint frequency, WAL segment size, and bgwriter to maximize write throughput

JusDB Team
March 14, 2023
11 min read
159 views
PostgreSQL Checkpoint and WAL Tuning for High Throughput | JusDB

A PostgreSQL database that performs beautifully under light load can grind to a halt the moment checkpoint I/O competes with your application writes. At 50,000 transactions per minute, the default checkpoint settings ship PostgreSQL into a wall of disk latency that no amount of RAM or faster CPUs will fix. The culprit is almost always misconfigured checkpoints and WAL parameters that were never revisited after initial deployment. Getting these numbers right is one of the highest-leverage tuning exercises available to a PostgreSQL DBA, and the changes are almost always non-destructive.

TL;DR
  • Set checkpoint_completion_target = 0.9 to spread checkpoint I/O across 90% of the checkpoint interval instead of spiking at the end.
  • Raise max_wal_size (try 4 GB–16 GB) to reduce checkpoint frequency under heavy write loads.
  • Set min_wal_size to a meaningful value (e.g., 1 GB) to prevent WAL segment recycling churn.
  • Size wal_buffers to 64 MB for most workloads; the default (auto-tuned to 1/32 of shared_buffers) is often too small.
  • Enable wal_compression to reduce WAL volume at the cost of a small CPU overhead.
  • Monitor pg_stat_bgwriter to distinguish timed checkpoints from demand checkpoints.
  • Watch checkpoints_req — a rising count means WAL is filling faster than your interval allows.

Background: Why Checkpoints and WAL Matter

PostgreSQL uses a Write-Ahead Log (WAL) to guarantee durability. Every change to a data page is first written sequentially to WAL before it is applied to the actual heap files. This design means crash recovery only needs to replay WAL from the last known-good checkpoint, rather than verify every data page on disk.

A checkpoint is the process of flushing all dirty shared buffer pages to disk and recording a consistent point in WAL from which recovery can start. After a successful checkpoint, PostgreSQL can reclaim the WAL segments that preceded it. This sounds routine, but the flush operation is inherently expensive: every dirty page in shared_buffers must be written to disk. On a write-heavy OLTP system with hundreds of gigabytes of shared buffers, a naive checkpoint can saturate disk I/O for seconds at a time, causing application latency spikes that are nearly impossible to diagnose without knowing where to look.

Two forces trigger a checkpoint:

  • Time-based (timed): checkpoint_timeout elapses (default 5 minutes).
  • Demand-based (required): WAL grows past max_wal_size (default 1 GB).

Demand checkpoints are dangerous. They happen outside the scheduled interval, often at the worst possible moment — peak traffic — and they cannot be spread out gracefully by checkpoint_completion_target. The entire goal of WAL and checkpoint tuning is to eliminate demand checkpoints and make timed checkpoints as smooth as possible.

How Checkpoints Work Internally

When a checkpoint begins, PostgreSQL identifies every dirty page in shared_buffers. Rather than writing them all at once (which would cause a large I/O spike), it spreads the writes over a target window defined by checkpoint_completion_target. The checkpointer process aims to finish writing all dirty pages within checkpoint_completion_target × checkpoint_timeout seconds — for example, 0.9 × 300s = 270s out of a 300-second interval.

Full-page writes amplify WAL volume significantly. After a checkpoint, the first time PostgreSQL modifies any data page it writes the entire 8 KB page into WAL (not just the changed bytes). This protects against partial-page writes on crash, but it means WAL volume can be 2–5x larger than the raw change volume in the period immediately following a checkpoint. Tuning checkpoint_timeout to allow longer intervals directly reduces the fraction of time spent in the post-checkpoint full-page-write window, which in turn reduces total WAL volume and the I/O tax on your storage subsystem.

Warning

Do not disable full_page_writes unless you are running on storage that guarantees atomic 8 KB writes (very rare). Disabling it risks silent data corruption after a crash. The correct lever for reducing full-page-write overhead is lengthening the checkpoint interval, not disabling the protection.

WAL Configuration Parameters

These parameters live in postgresql.conf and require a reload (or restart, where noted):

ini
# Require restart
wal_buffers = 64MB          # default: -1 (auto, usually ~4 MB for most configs)

# Require reload
max_wal_size = 8GB          # default: 1GB  — max WAL before a forced checkpoint
min_wal_size = 1GB          # default: 80MB — keep at least this much WAL pre-allocated
checkpoint_completion_target = 0.9   # default: 0.9 in PG 14+, 0.5 in older versions
checkpoint_timeout = 10min  # default: 5min — how often to checkpoint on time
wal_compression = on        # default: off  — compress full-page images in WAL

wal_buffers

WAL buffers are a small shared-memory ring buffer where WAL records are staged before being flushed to disk. The auto-tuned default is 1/32 of shared_buffers, capped at 64 MB. If your shared_buffers is 8 GB, the auto value is 256 MB — which is excessive. If it is 512 MB, the auto value is only 16 MB — which can cause contention on high-concurrency workloads. Setting wal_buffers = 64MB explicitly is a safe, well-tested value for most OLTP deployments. This parameter requires a server restart.

max_wal_size and min_wal_size

max_wal_size is the soft limit on how much WAL PostgreSQL will accumulate before forcing a demand checkpoint. The default of 1 GB is far too low for write-heavy workloads — it can force dozens of demand checkpoints per hour on a busy system. Raising it to 4 GB–16 GB gives the checkpointer room to operate on schedule.

min_wal_size controls how many WAL segments are kept pre-allocated on disk. When PostgreSQL recycles a WAL segment, it renames it rather than deleting and recreating it, which avoids filesystem allocation overhead. Setting min_wal_size = 1GB ensures there is always a pool of pre-allocated segments available, preventing recycling churn during write bursts.

wal_compression

wal_compression = on compresses full-page images inside WAL records using LZ4 (or zstd in PostgreSQL 15+). Since full-page images dominate WAL volume after a checkpoint, this can reduce WAL size by 40–70% on typical data, directly lowering both WAL write I/O and WAL archiving bandwidth. The CPU cost is measurable but rarely a bottleneck — most systems see less than 1% overhead.

Tip

If you are running PostgreSQL 15 or later, prefer wal_compression = zstd over lz4. zstd achieves better compression ratios at comparable speed, which matters most during WAL archiving to object storage where bandwidth costs are real.

Checkpoint Tuning in Practice

Start by diagnosing the current state. Run this query to see how your checkpoints are distributed:

sql
SELECT
  checkpoints_timed,
  checkpoints_req,
  checkpoint_write_time,
  checkpoint_sync_time,
  buffers_checkpoint,
  buffers_clean,
  buffers_backend,
  maxwritten_clean,
  stats_reset
FROM pg_stat_bgwriter;

Key columns to examine:

  • checkpoints_timed: Checkpoints triggered by checkpoint_timeout. You want this to be the dominant number.
  • checkpoints_req: Checkpoints triggered by WAL size or explicit CHECKPOINT commands. A non-zero and rising value is a red flag — your max_wal_size is too small.
  • checkpoint_write_time / checkpoint_sync_time: Milliseconds spent writing dirty buffers vs. syncing them. High sync_time indicates storage I/O pressure.
  • buffers_backend: Pages written directly by backend processes (not the checkpointer or bgwriter). A high value means backends are being forced to flush their own dirty pages — a sign of bgwriter under-configuration or memory pressure.
  • maxwritten_clean: How many times bgwriter hit its bgwriter_lru_maxpages limit and stopped cleaning. A non-zero value suggests bgwriter is not keeping up with the dirty page rate.

Setting checkpoint_completion_target

PostgreSQL 14 changed the default from 0.5 to 0.9, which was a meaningful improvement. If you are on PostgreSQL 13 or earlier, verifying this parameter is the single most impactful checkpoint change you can make. At 0.5, the checkpointer attempts to finish in the first half of the interval, creating an I/O spike in the second half. At 0.9, the I/O is spread across nearly the entire interval.

ini
checkpoint_completion_target = 0.9

Watching for checkpoint_warning

PostgreSQL logs a warning whenever a checkpoint takes longer than checkpoint_warning seconds (default: 30 seconds). Watch your logs for lines like:

text
LOG:  checkpoint request at ...; write: 45.823 s, sync: 2.109 s, total: 47.932 s;
      distance: 4096 MB, estimate: 3840 MB
WARNING:  checkpoints are occurring too frequently (23 seconds apart)

"Occurring too frequently" means demand checkpoints are firing before the timed interval. The fix is to raise max_wal_size until the warning disappears and checkpoints_req in pg_stat_bgwriter stops growing.

bgwriter_lru_maxpages

The background writer (bgwriter) proactively cleans dirty pages from shared buffers so that when a backend needs a free buffer, one is available without an on-demand flush. The parameter bgwriter_lru_maxpages (default: 100) caps how many pages bgwriter can clean per cycle. On write-heavy systems, increasing this to 200–500 can reduce buffers_backend writes significantly:

ini
bgwriter_lru_maxpages = 200
bgwriter_delay = 50ms       # default: 200ms — how often bgwriter wakes up
Warning

Do not set bgwriter_lru_maxpages excessively high (e.g., 1000+). An overactive bgwriter competes with the checkpointer and application I/O, potentially causing more harm than good. Increase it incrementally and measure buffers_backend and maxwritten_clean after each change.

Monitoring Checkpoint I/O with iostat

pg_stat_bgwriter gives cumulative counters, but for real-time visibility into checkpoint I/O impact, iostat is invaluable:

bash
# Report device utilization every 2 seconds, extended statistics
iostat -xz 2

Focus on the device hosting your PostgreSQL data directory. Key metrics:

  • %util: Device saturation. Values above 80–90% during checkpoints confirm I/O is the bottleneck.
  • await (or r_await / w_await): Average I/O latency. Spikes here correspond to checkpoint I/O waves.
  • wMB/s: Write throughput. A distinctive pattern of writes that peak at regular intervals (corresponding to your checkpoint_timeout) confirms you are seeing checkpoint I/O.

Correlate iostat output with your application latency metrics and PostgreSQL log timestamps to confirm that I/O spikes align with checkpoint activity. If they do, the configuration changes in this post are the right levers.

WAL Archiving Interaction

If you are running WAL archiving for point-in-time recovery (PITR), checkpoint and WAL tuning interact in a critical way: WAL segments are not removed after a checkpoint until the archiver confirms they have been successfully archived. Under high WAL generation rates, a slow or overloaded archiver can cause WAL to accumulate on the primary past max_wal_size, triggering demand checkpoints even with a well-tuned configuration.

Monitor the archiver with:

sql
SELECT
  archived_count,
  last_archived_wal,
  last_archived_time,
  failed_count,
  last_failed_wal,
  last_failed_time
FROM pg_stat_archiver;

A rising failed_count or a last_archived_time that lags the current time by more than a few minutes under load indicates an archiving bottleneck. Common fixes include using a faster archive destination, increasing archive parallelism (via archive_library in PostgreSQL 15+ or a custom archive_command with parallel tools like pgbackrest), or simply ensuring the archiver host has adequate network bandwidth.

Tip

When using pgbackrest for WAL archiving, set process-max in your pgbackrest configuration to match the number of WAL segments being generated per second. A single-threaded archiver is almost always the bottleneck on high-throughput primaries writing 100+ MB/s of WAL.

A Practical Starting Configuration

For a write-heavy OLTP system with 32 GB RAM and NVMe storage, a reasonable starting point:

ini
# postgresql.conf — checkpoint and WAL tuning for high throughput

# WAL buffers (restart required)
wal_buffers = 64MB

# WAL size limits
min_wal_size = 1GB
max_wal_size = 8GB

# Checkpoint timing
checkpoint_timeout = 10min
checkpoint_completion_target = 0.9
checkpoint_warning = 30s

# WAL compression (PostgreSQL 15+: consider zstd)
wal_compression = on

# Background writer
bgwriter_lru_maxpages = 200
bgwriter_delay = 50ms

After applying these changes (reload for most, restart for wal_buffers), allow the system to run under production load for at least one hour before evaluating. Reset pg_stat_bgwriter stats first so you have a clean baseline:

sql
SELECT pg_stat_reset();
-- Wait 1 hour under load, then:
SELECT
  checkpoints_timed,
  checkpoints_req,
  round(checkpoints_req::numeric / nullif(checkpoints_timed + checkpoints_req, 0) * 100, 1) AS pct_req,
  round(checkpoint_write_time / 1000 / nullif(checkpoints_timed + checkpoints_req, 0), 2) AS avg_write_s,
  round(checkpoint_sync_time  / 1000 / nullif(checkpoints_timed + checkpoints_req, 0), 2) AS avg_sync_s
FROM pg_stat_bgwriter;

Target: pct_req below 5%, avg_write_s well within your checkpoint_timeout window, and avg_sync_s under 2 seconds on NVMe (under 10 seconds on spinning disk).

Key Takeaways
  • Demand checkpoints (checkpoints_req) are the primary cause of I/O-induced latency spikes — eliminate them by raising max_wal_size.
  • checkpoint_completion_target = 0.9 spreads checkpoint I/O smoothly across the interval; verify it is set correctly, especially on PostgreSQL 13 and earlier where the default is 0.5.
  • Full-page writes inflate WAL volume after every checkpoint; longer checkpoint_timeout intervals reduce this overhead without sacrificing durability.
  • wal_compression = on reduces WAL I/O and archiving bandwidth with minimal CPU cost — enable it unless you are CPU-bound.
  • Size wal_buffers = 64MB explicitly rather than relying on the auto-tuned default, which can be too small for high-concurrency workloads.
  • Tune bgwriter_lru_maxpages to keep buffers_backend writes low — backends forced to write their own dirty pages are a latency path you do not want.
  • Use iostat -xz 2 correlated with PostgreSQL log timestamps to confirm checkpoint I/O is the root cause before tuning.
  • WAL archiving bottlenecks can cause demand checkpoints even with correct WAL size settings — monitor pg_stat_archiver alongside pg_stat_bgwriter.

Manage Your PostgreSQL Infrastructure with JusDB

Tuning checkpoint and WAL parameters correctly requires understanding your specific workload patterns, storage characteristics, and recovery requirements — and the right values shift as your data and traffic grow. JusDB provides managed PostgreSQL with intelligent defaults, continuous performance monitoring, and expert tuning recommendations built in. Whether you are running a high-throughput OLTP system or a mixed analytics workload, JusDB surfaces the metrics that matter — checkpoints_req, bgwriter efficiency, WAL generation rate, and archiver lag — so you can act before latency spikes reach your users.

Explore JusDB to see how managed PostgreSQL with proactive performance tuning can simplify your database operations.

Share this article