A PostgreSQL database that performs beautifully under light load can grind to a halt the moment checkpoint I/O competes with your application writes. At 50,000 transactions per minute, the default checkpoint settings ship PostgreSQL into a wall of disk latency that no amount of RAM or faster CPUs will fix. The culprit is almost always misconfigured checkpoints and WAL parameters that were never revisited after initial deployment. Getting these numbers right is one of the highest-leverage tuning exercises available to a PostgreSQL DBA, and the changes are almost always non-destructive.
- Set
checkpoint_completion_target = 0.9to spread checkpoint I/O across 90% of the checkpoint interval instead of spiking at the end. - Raise
max_wal_size(try 4 GB–16 GB) to reduce checkpoint frequency under heavy write loads. - Set
min_wal_sizeto a meaningful value (e.g., 1 GB) to prevent WAL segment recycling churn. - Size
wal_buffersto 64 MB for most workloads; the default (auto-tuned to 1/32 ofshared_buffers) is often too small. - Enable
wal_compressionto reduce WAL volume at the cost of a small CPU overhead. - Monitor
pg_stat_bgwriterto distinguish timed checkpoints from demand checkpoints. - Watch
checkpoints_req— a rising count means WAL is filling faster than your interval allows.
Background: Why Checkpoints and WAL Matter
PostgreSQL uses a Write-Ahead Log (WAL) to guarantee durability. Every change to a data page is first written sequentially to WAL before it is applied to the actual heap files. This design means crash recovery only needs to replay WAL from the last known-good checkpoint, rather than verify every data page on disk.
A checkpoint is the process of flushing all dirty shared buffer pages to disk and
recording a consistent point in WAL from which recovery can start. After a successful checkpoint,
PostgreSQL can reclaim the WAL segments that preceded it. This sounds routine, but the flush
operation is inherently expensive: every dirty page in shared_buffers must be written
to disk. On a write-heavy OLTP system with hundreds of gigabytes of shared buffers, a naive
checkpoint can saturate disk I/O for seconds at a time, causing application latency spikes that
are nearly impossible to diagnose without knowing where to look.
Two forces trigger a checkpoint:
- Time-based (timed):
checkpoint_timeoutelapses (default 5 minutes). - Demand-based (required): WAL grows past
max_wal_size(default 1 GB).
Demand checkpoints are dangerous. They happen outside the scheduled interval, often at the worst
possible moment — peak traffic — and they cannot be spread out gracefully by
checkpoint_completion_target. The entire goal of WAL and checkpoint tuning is to
eliminate demand checkpoints and make timed checkpoints as smooth as possible.
How Checkpoints Work Internally
When a checkpoint begins, PostgreSQL identifies every dirty page in shared_buffers.
Rather than writing them all at once (which would cause a large I/O spike), it spreads the writes
over a target window defined by checkpoint_completion_target. The checkpointer
process aims to finish writing all dirty pages within
checkpoint_completion_target × checkpoint_timeout seconds — for example,
0.9 × 300s = 270s out of a 300-second interval.
Full-page writes amplify WAL volume significantly. After a checkpoint, the first
time PostgreSQL modifies any data page it writes the entire 8 KB page into WAL (not just the
changed bytes). This protects against partial-page writes on crash, but it means WAL volume can
be 2–5x larger than the raw change volume in the period immediately following a checkpoint. Tuning
checkpoint_timeout to allow longer intervals directly reduces the fraction of time
spent in the post-checkpoint full-page-write window, which in turn reduces total WAL volume and
the I/O tax on your storage subsystem.
Do not disable full_page_writes unless you are running on storage that guarantees
atomic 8 KB writes (very rare). Disabling it risks silent data corruption after a crash. The
correct lever for reducing full-page-write overhead is lengthening the checkpoint interval, not
disabling the protection.
WAL Configuration Parameters
These parameters live in postgresql.conf and require a reload (or restart, where
noted):
# Require restart
wal_buffers = 64MB # default: -1 (auto, usually ~4 MB for most configs)
# Require reload
max_wal_size = 8GB # default: 1GB — max WAL before a forced checkpoint
min_wal_size = 1GB # default: 80MB — keep at least this much WAL pre-allocated
checkpoint_completion_target = 0.9 # default: 0.9 in PG 14+, 0.5 in older versions
checkpoint_timeout = 10min # default: 5min — how often to checkpoint on time
wal_compression = on # default: off — compress full-page images in WALwal_buffers
WAL buffers are a small shared-memory ring buffer where WAL records are staged before being
flushed to disk. The auto-tuned default is 1/32 of shared_buffers, capped at
64 MB. If your shared_buffers is 8 GB, the auto value is 256 MB — which is
excessive. If it is 512 MB, the auto value is only 16 MB — which can cause contention on
high-concurrency workloads. Setting wal_buffers = 64MB explicitly is a safe,
well-tested value for most OLTP deployments. This parameter requires a server restart.
max_wal_size and min_wal_size
max_wal_size is the soft limit on how much WAL PostgreSQL will accumulate before
forcing a demand checkpoint. The default of 1 GB is far too low for write-heavy workloads — it
can force dozens of demand checkpoints per hour on a busy system. Raising it to 4 GB–16 GB gives
the checkpointer room to operate on schedule.
min_wal_size controls how many WAL segments are kept pre-allocated on disk. When
PostgreSQL recycles a WAL segment, it renames it rather than deleting and recreating it, which
avoids filesystem allocation overhead. Setting min_wal_size = 1GB ensures there is
always a pool of pre-allocated segments available, preventing recycling churn during write bursts.
wal_compression
wal_compression = on compresses full-page images inside WAL records using LZ4 (or
zstd in PostgreSQL 15+). Since full-page images dominate WAL volume after a checkpoint, this can
reduce WAL size by 40–70% on typical data, directly lowering both WAL write I/O and WAL archiving
bandwidth. The CPU cost is measurable but rarely a bottleneck — most systems see less than 1%
overhead.
If you are running PostgreSQL 15 or later, prefer wal_compression = zstd over
lz4. zstd achieves better compression ratios at comparable speed, which matters
most during WAL archiving to object storage where bandwidth costs are real.
Checkpoint Tuning in Practice
Start by diagnosing the current state. Run this query to see how your checkpoints are distributed:
SELECT
checkpoints_timed,
checkpoints_req,
checkpoint_write_time,
checkpoint_sync_time,
buffers_checkpoint,
buffers_clean,
buffers_backend,
maxwritten_clean,
stats_reset
FROM pg_stat_bgwriter;Key columns to examine:
checkpoints_timed: Checkpoints triggered bycheckpoint_timeout. You want this to be the dominant number.checkpoints_req: Checkpoints triggered by WAL size or explicitCHECKPOINTcommands. A non-zero and rising value is a red flag — yourmax_wal_sizeis too small.checkpoint_write_time/checkpoint_sync_time: Milliseconds spent writing dirty buffers vs. syncing them. Highsync_timeindicates storage I/O pressure.buffers_backend: Pages written directly by backend processes (not the checkpointer or bgwriter). A high value means backends are being forced to flush their own dirty pages — a sign of bgwriter under-configuration or memory pressure.maxwritten_clean: How many times bgwriter hit itsbgwriter_lru_maxpageslimit and stopped cleaning. A non-zero value suggests bgwriter is not keeping up with the dirty page rate.
Setting checkpoint_completion_target
PostgreSQL 14 changed the default from 0.5 to 0.9, which was a
meaningful improvement. If you are on PostgreSQL 13 or earlier, verifying this parameter is the
single most impactful checkpoint change you can make. At 0.5, the checkpointer
attempts to finish in the first half of the interval, creating an I/O spike in the second half.
At 0.9, the I/O is spread across nearly the entire interval.
checkpoint_completion_target = 0.9Watching for checkpoint_warning
PostgreSQL logs a warning whenever a checkpoint takes longer than checkpoint_warning
seconds (default: 30 seconds). Watch your logs for lines like:
LOG: checkpoint request at ...; write: 45.823 s, sync: 2.109 s, total: 47.932 s;
distance: 4096 MB, estimate: 3840 MB
WARNING: checkpoints are occurring too frequently (23 seconds apart)
"Occurring too frequently" means demand checkpoints are firing before the timed interval. The fix
is to raise max_wal_size until the warning disappears and
checkpoints_req in pg_stat_bgwriter stops growing.
bgwriter_lru_maxpages
The background writer (bgwriter) proactively cleans dirty pages from shared buffers so that when
a backend needs a free buffer, one is available without an on-demand flush. The parameter
bgwriter_lru_maxpages (default: 100) caps how many pages bgwriter can clean per
cycle. On write-heavy systems, increasing this to 200–500 can reduce buffers_backend
writes significantly:
bgwriter_lru_maxpages = 200
bgwriter_delay = 50ms # default: 200ms — how often bgwriter wakes up
Do not set bgwriter_lru_maxpages excessively high (e.g., 1000+). An overactive
bgwriter competes with the checkpointer and application I/O, potentially causing more harm than
good. Increase it incrementally and measure buffers_backend and
maxwritten_clean after each change.
Monitoring Checkpoint I/O with iostat
pg_stat_bgwriter gives cumulative counters, but for real-time visibility into
checkpoint I/O impact, iostat is invaluable:
# Report device utilization every 2 seconds, extended statistics
iostat -xz 2Focus on the device hosting your PostgreSQL data directory. Key metrics:
%util: Device saturation. Values above 80–90% during checkpoints confirm I/O is the bottleneck.await(orr_await/w_await): Average I/O latency. Spikes here correspond to checkpoint I/O waves.wMB/s: Write throughput. A distinctive pattern of writes that peak at regular intervals (corresponding to yourcheckpoint_timeout) confirms you are seeing checkpoint I/O.
Correlate iostat output with your application latency metrics and PostgreSQL log
timestamps to confirm that I/O spikes align with checkpoint activity. If they do, the
configuration changes in this post are the right levers.
WAL Archiving Interaction
If you are running WAL archiving for point-in-time recovery (PITR), checkpoint and WAL tuning
interact in a critical way: WAL segments are not removed after a checkpoint until the archiver
confirms they have been successfully archived. Under high WAL generation rates, a slow or
overloaded archiver can cause WAL to accumulate on the primary past max_wal_size,
triggering demand checkpoints even with a well-tuned configuration.
Monitor the archiver with:
SELECT
archived_count,
last_archived_wal,
last_archived_time,
failed_count,
last_failed_wal,
last_failed_time
FROM pg_stat_archiver;
A rising failed_count or a last_archived_time that lags the current
time by more than a few minutes under load indicates an archiving bottleneck. Common fixes include
using a faster archive destination, increasing archive parallelism (via archive_library
in PostgreSQL 15+ or a custom archive_command with parallel tools like
pgbackrest), or simply ensuring the archiver host has adequate network bandwidth.
When using pgbackrest for WAL archiving, set process-max in your pgbackrest
configuration to match the number of WAL segments being generated per second. A single-threaded
archiver is almost always the bottleneck on high-throughput primaries writing 100+ MB/s of WAL.
A Practical Starting Configuration
For a write-heavy OLTP system with 32 GB RAM and NVMe storage, a reasonable starting point:
# postgresql.conf — checkpoint and WAL tuning for high throughput
# WAL buffers (restart required)
wal_buffers = 64MB
# WAL size limits
min_wal_size = 1GB
max_wal_size = 8GB
# Checkpoint timing
checkpoint_timeout = 10min
checkpoint_completion_target = 0.9
checkpoint_warning = 30s
# WAL compression (PostgreSQL 15+: consider zstd)
wal_compression = on
# Background writer
bgwriter_lru_maxpages = 200
bgwriter_delay = 50ms
After applying these changes (reload for most, restart for wal_buffers), allow the
system to run under production load for at least one hour before evaluating. Reset
pg_stat_bgwriter stats first so you have a clean baseline:
SELECT pg_stat_reset();
-- Wait 1 hour under load, then:
SELECT
checkpoints_timed,
checkpoints_req,
round(checkpoints_req::numeric / nullif(checkpoints_timed + checkpoints_req, 0) * 100, 1) AS pct_req,
round(checkpoint_write_time / 1000 / nullif(checkpoints_timed + checkpoints_req, 0), 2) AS avg_write_s,
round(checkpoint_sync_time / 1000 / nullif(checkpoints_timed + checkpoints_req, 0), 2) AS avg_sync_s
FROM pg_stat_bgwriter;
Target: pct_req below 5%, avg_write_s well within your
checkpoint_timeout window, and avg_sync_s under 2 seconds on NVMe
(under 10 seconds on spinning disk).
- Demand checkpoints (
checkpoints_req) are the primary cause of I/O-induced latency spikes — eliminate them by raisingmax_wal_size. checkpoint_completion_target = 0.9spreads checkpoint I/O smoothly across the interval; verify it is set correctly, especially on PostgreSQL 13 and earlier where the default is 0.5.- Full-page writes inflate WAL volume after every checkpoint; longer
checkpoint_timeoutintervals reduce this overhead without sacrificing durability. wal_compression = onreduces WAL I/O and archiving bandwidth with minimal CPU cost — enable it unless you are CPU-bound.- Size
wal_buffers = 64MBexplicitly rather than relying on the auto-tuned default, which can be too small for high-concurrency workloads. - Tune
bgwriter_lru_maxpagesto keepbuffers_backendwrites low — backends forced to write their own dirty pages are a latency path you do not want. - Use
iostat -xz 2correlated with PostgreSQL log timestamps to confirm checkpoint I/O is the root cause before tuning. - WAL archiving bottlenecks can cause demand checkpoints even with correct WAL size settings — monitor
pg_stat_archiveralongsidepg_stat_bgwriter.
Manage Your PostgreSQL Infrastructure with JusDB
Tuning checkpoint and WAL parameters correctly requires understanding your specific workload
patterns, storage characteristics, and recovery requirements — and the right values shift as your
data and traffic grow. JusDB provides managed PostgreSQL with intelligent defaults, continuous
performance monitoring, and expert tuning recommendations built in. Whether you are running a
high-throughput OLTP system or a mixed analytics workload, JusDB surfaces the metrics that matter
— checkpoints_req, bgwriter efficiency, WAL generation rate, and archiver lag — so
you can act before latency spikes reach your users.
Explore JusDB to see how managed PostgreSQL with proactive performance tuning can simplify your database operations.