PostgreSQL

PostgreSQL Architecture Deep Dive: Process Model, MVCC, WAL & Replication Explained

Walk through PostgreSQL's multi-process architecture, shared/local memory layout, page-organized storage, MVCC tuple versioning, the WAL write path, the query execution pipeline, and physical + logical replication — all with ASCII flow diagrams that show how data and control actually move through the system.

JusDB Team
May 31, 2026
18 min read
1 views

PostgreSQL is often described as "the most advanced open-source relational database," but its architecture is the reason that description has held up across two decades of cloud, NoSQL, and analytics disruption. The same multi-process, MVCC-backed, WAL-anchored design that POSTGRES95 shipped is what runs Instagram, Reddit, GitLab, and the entire AWS RDS PostgreSQL fleet today. Understanding that architecture is the difference between a DBA who tunes parameters by superstition and one who can reason about what a running cluster is actually doing under load. This guide walks through every major subsystem — process model, memory layout, storage engine, MVCC, WAL, the query execution pipeline, and replication — with ASCII flow diagrams that show how data and control move through the system.

TL;DR
  • PostgreSQL uses a multi-process (not multi-threaded) architecture: a postmaster forks one backend process per client connection plus auxiliary processes for background work.
  • Memory is split into shared memory (buffer pool, WAL buffers, CLOG, lock tables) and per-backend local memory (work_mem, temp_buffers, catalog cache).
  • Storage is page-organized at 8 KB blocks; tables and indexes live in heap files with a tuple-per-row layout that enables MVCC without locks.
  • MVCC (Multi-Version Concurrency Control) gives readers a snapshot view without blocking writers — at the cost of bloat that VACUUM has to reclaim.
  • The WAL (Write-Ahead Log) is the durability backbone: every change hits WAL before data pages flush, enabling crash recovery, streaming replication, and PITR.
  • A query travels through parser → analyzer → rewriter → planner → executor, with the planner doing cost-based optimization over candidate plan trees.
  • Replication is built on the same WAL stream: physical replication ships raw WAL, logical replication decodes WAL into row-level change events.

The Process Model: Postmaster, Backends, and Background Workers

PostgreSQL does not use threads. Every client connection gets its own OS process, forked from a single supervisor called the postmaster. This choice is older than threading libraries were mature on Unix, but it has stuck around because process isolation makes debugging easier, a crashing backend cannot corrupt sibling backend memory, and the kernel handles scheduling and memory protection for free.

When you start postgres, the postmaster is the first thing that runs. It opens the listening socket, allocates shared memory, launches a small fleet of auxiliary processes, and then sits in an accept() loop waiting for clients.

text
PostgreSQL Process Tree

                       ┌──────────────────────┐
                       │      postmaster      │  parent supervisor
                       │      (pid 1234)      │  listens on :5432
                       └──────────┬───────────┘
                                  │ fork()
       ┌──────────┬───────────────┼───────────────┬──────────┬──────────┐
       ▼          ▼               ▼               ▼          ▼          ▼
  ┌────────┐ ┌────────┐    ┌────────────┐  ┌────────┐  ┌────────┐ ┌──────────┐
  │ backend│ │ backend│    │ background │  │  WAL   │  │ check- │ │ autovac- │
  │ (app1) │ │ (app2) │ …  │   writer   │  │ writer │  │ pointer│ │  uum     │
  └───┬────┘ └───┬────┘    └────────────┘  └────────┘  └────────┘ └──────────┘
      │ shared memory + WAL writes
      ▼
  ┌──────────────────────────────────────────────────────────────────────┐
  │                    Shared Memory Region (mmap)                        │
  │  shared_buffers │ WAL buffers │ CLOG │ lock table │ proc array │ …    │
  └──────────────────────────────────────────────────────────────────────┘

The major auxiliary processes you should know:

  • background writer (bgwriter) — proactively flushes dirty buffers from the buffer pool to disk in the background, smoothing out I/O so checkpoints do less work.
  • WAL writer — flushes WAL buffers to the WAL files on disk on a timer or when a backend commits. The commit path waits for this.
  • checkpointer — performs full checkpoints, the operation that flushes every dirty buffer and marks a recovery starting point in the WAL.
  • autovacuum launcher + workers — scans pg_stat_all_tables for tables that have accumulated dead tuples and forks worker processes to VACUUM and ANALYZE them.
  • logical replication launcher — when logical replication is configured, this launches apply workers for each subscription.
  • stats collector / cumulative statistics — in PostgreSQL 15+ this moved into shared memory; in older versions it ran as a separate process collecting UDP packets from backends.
  • archiver — when archive_mode = on, this process runs archive_command to copy completed WAL segments to a long-term archive (S3, GCS, NFS).

Per-connection backends are forked on demand. Each backend is its own OS process, with its own private memory for sorting, hashing, the catalog cache, and the prepared statement plans for that session. This is why max_connections matters so much: every connection is real memory, and the kernel context-switch cost is non-zero. Production setups front PostgreSQL with a pooler like PgBouncer precisely because direct connections at scale do not pay back the memory cost.

Tip

Run SELECT pid, backend_type, application_name, state, query FROM pg_stat_activity; on any running cluster to see exactly which processes are alive — backends, autovacuum workers, walsender, walreceiver, and all the auxiliary singletons each show up with a recognizable backend_type.

Memory Layout: Shared Buffers and Per-Backend Workspace

PostgreSQL splits memory into two clearly demarcated regions. Shared memory is allocated once at startup by the postmaster, mapped into every backend's address space, and protected by lightweight locks (LWLocks) for concurrent access. Local memory is private to each backend and recycled when the connection ends.

text
Memory Layout (single instance)

  ┌────────────────────────────────────────────────────────────────────────┐
  │                     SHARED MEMORY (allocated once)                      │
  ├────────────────────────────────────────────────────────────────────────┤
  │  shared_buffers           │  buffer pool — cached 8 KB pages of tables  │
  │  (default 128 MB,         │  and indexes. Default sizing: 25% of RAM.   │
  │   typical prod: 25% RAM)  │                                              │
  ├───────────────────────────┼─────────────────────────────────────────────┤
  │  wal_buffers              │  WAL records staged here before flush       │
  │  (typical 16 MB)          │  to pg_wal/ on commit                       │
  ├───────────────────────────┼─────────────────────────────────────────────┤
  │  CLOG (commit log)        │  transaction status (in-progress, committed,│
  │                           │  aborted) for every XID — backs MVCC        │
  ├───────────────────────────┼─────────────────────────────────────────────┤
  │  proc array               │  PID/XID of every active backend            │
  ├───────────────────────────┼─────────────────────────────────────────────┤
  │  lock table               │  heavyweight locks (table, row, advisory)   │
  ├───────────────────────────┼─────────────────────────────────────────────┤
  │  predicate locks (SSI)    │  serializable isolation tracking            │
  ├───────────────────────────┼─────────────────────────────────────────────┤
  │  visibility map / FSM     │  per-relation page-status bitmaps           │
  └────────────────────────────────────────────────────────────────────────┘

  ┌────────────────────────────────────────────────────────────────────────┐
  │                    LOCAL MEMORY (per backend, private)                  │
  ├────────────────────────────────────────────────────────────────────────┤
  │  work_mem        │  sort, hash, and merge-join workspace per operation  │
  │  (default 4 MB)  │  — multiple per query!                                │
  ├──────────────────┼───────────────────────────────────────────────────────┤
  │  maintenance_    │  VACUUM, CREATE INDEX, ALTER TABLE ADD FK             │
  │  work_mem        │                                                       │
  │  (default 64 MB) │                                                       │
  ├──────────────────┼───────────────────────────────────────────────────────┤
  │  temp_buffers    │  cache of temporary table pages                      │
  │  (default 8 MB)  │                                                       │
  ├──────────────────┼───────────────────────────────────────────────────────┤
  │  catcache /      │  cached system catalog rows (per backend)            │
  │  relcache        │                                                       │
  └────────────────────────────────────────────────────────────────────────┘

The single most important parameter here is shared_buffers. On a 32 GB box with a write-heavy OLTP workload, sizing it to 8 GB (25%) gives PostgreSQL enough pool to cache the hot working set without crowding out the OS page cache. PostgreSQL actually relies on a double-buffer strategy: shared_buffers holds the InnoDB-equivalent buffer pool, but the kernel page cache also caches the underlying file data. This is why pushing shared_buffers above 40% of RAM often hurts rather than helps — you starve the OS cache that the rest of the system needs.

work_mem is the most misunderstood. It is the amount of memory per sort or hash operation, per backend. A complex query with three hash joins and a sort might allocate 4 × work_mem just for that one query. Setting it too high on a high-concurrency box invites OOM kills; setting it too low forces sorts to spill to base/pgsql_tmp on disk. The right value depends on workload shape, not RAM alone.

Storage Layout: Heaps, Pages, Tuples

PostgreSQL stores every table and index as a file on disk under $PGDATA/base/<db_oid>/<relfilenode>. Each file is divided into 8 KB pages (the default block size, set at initdb time and immutable afterward). Pages are read into shared_buffers in 8 KB units; this is the granularity of every buffered I/O operation.

text
Heap Page Layout (8 KB)

  ┌─────────────────────────────────────────────────────────────────────┐
  │ PageHeaderData (24 bytes)                                            │
  │   lsn │ checksum │ flags │ pd_lower │ pd_upper │ pd_special │ …      │
  ├─────────────────────────────────────────────────────────────────────┤
  │ ItemIdData[]  (line pointers, 4 bytes each, grows downward) →        │
  │   ┌──┬──┬──┬──┬──┬──┐                                                │
  │   │1 │2 │3 │4 │5 │6 │   each = (offset, length, flags)               │
  │   └──┴──┴──┴──┴──┴──┘                                                │
  │                                                                       │
  │                            ← free space →                            │
  │                                                                       │
  │                  ┌─────────────────────────────────┐                 │
  │                  │ Tuple 6 (heap row + header)     │                 │
  │                  ├─────────────────────────────────┤                 │
  │                  │ Tuple 5                         │                 │
  │                  ├─────────────────────────────────┤                 │
  │                  │ Tuple 4                         │                 │
  │                  ├─────────────────────────────────┤                 │
  │                  │ Tuple 3                         │  tuples grow    │
  │                  ├─────────────────────────────────┤  upward from    │
  │                  │ Tuple 2                         │  the end        │
  │                  ├─────────────────────────────────┤                 │
  │                  │ Tuple 1                         │                 │
  │                  └─────────────────────────────────┘                 │
  └─────────────────────────────────────────────────────────────────────┘

The clever piece is the line pointer array. Tuples are written from the bottom of the page upward; line pointers are written from the top down. They meet in the middle, and the page is full when they collide. Each line pointer encodes the offset and length of one tuple — when a tuple is updated in place under HOT (Heap-Only Tuple) updates, only the line pointer's target changes, not the index entries pointing to it. This is how PostgreSQL avoids index bloat on UPDATE-heavy workloads, when there is enough free space on the same page.

Each tuple has a tuple header (23 bytes minimum) carrying the MVCC fields: xmin (creating transaction ID), xmax (deleting/updating transaction ID), cmin/cmax (command IDs for same-transaction visibility), t_ctid (pointer to the next version of this row), and a bitmask of nullability and toast hints. This header is what makes MVCC work.

TOAST: when rows don't fit in 8 KB

PostgreSQL cannot store a row larger than 8 KB inline. When a row's variable-length columns push it past the threshold, the columns are compressed (pglz or lz4 since PG 14), and if still too large, sliced into chunks stored in a sidecar TOAST table named pg_toast_<relfilenode>. The main row keeps only a 19-byte pointer. This is transparent to queries but matters for tuple size accounting and VACUUM.

MVCC: How Readers Never Block Writers

Multi-Version Concurrency Control is the foundation of PostgreSQL's concurrency model. The promise is simple: readers never block writers, and writers never block readers. The way it's implemented is what produces the operational surprises (bloat, wraparound, long-running transactions holding up VACUUM).

text
MVCC Update Flow — UPDATE creates a new tuple version

      Time T1: tx 100 inserts row (id=5, name='Alice', balance=100)
      ┌──────────────────────────────────────────────────┐
      │ Page 42, line ptr 1 → Tuple V1                    │
      │   xmin=100, xmax=0, ctid=(42,1)                  │
      │   id=5, name='Alice', balance=100                │
      └──────────────────────────────────────────────────┘

      Time T2: tx 200 issues UPDATE balance=150 WHERE id=5
              and tx 150 is concurrently reading id=5
      ┌──────────────────────────────────────────────────┐
      │ Page 42, line ptr 1 → Tuple V1                    │
      │   xmin=100, xmax=200, ctid=(42,2)  ← marked dead │
      │   id=5, name='Alice', balance=100                │
      ├──────────────────────────────────────────────────┤
      │ Page 42, line ptr 2 → Tuple V2                    │
      │   xmin=200, xmax=0, ctid=(42,2)                  │
      │   id=5, name='Alice', balance=150                │
      └──────────────────────────────────────────────────┘
      tx 150 reads V1 (sees balance=100); tx 200's COMMIT
      makes V2 visible to transactions starting after T2.

      Time T3: VACUUM runs. V1's xmin/xmax pair are both
      now older than every active snapshot, so V1 is
      reclaimed and the page free space recovered.

Three things to internalize from this flow:

  1. An UPDATE is logically a DELETE+INSERT. The old tuple version is marked dead (xmax set) but kept on disk until VACUUM reclaims it. This is why update-heavy tables grow even when the row count is stable.
  2. Visibility is determined per snapshot. Every transaction takes a snapshot at start (or per statement, depending on isolation level) recording which XIDs are committed, aborted, or in-progress. A tuple is visible if its xmin is in the snapshot's committed set and its xmax is either zero or not in the committed set.
  3. A long-running transaction blocks VACUUM. If tx 1000 starts and sits idle for an hour, VACUUM cannot reclaim any tuple whose xmax is greater than the oldest active XID — even if those tuples are dead from every other transaction's view. This is the source of "idle in transaction" being the most operationally dangerous state in PostgreSQL.

VACUUM (and the autovacuum subsystem) walks heap pages looking for tuples whose xmax is older than the cluster-wide horizon. It marks those tuples' line pointers as reusable, updates the visibility map, and updates the free space map so subsequent INSERTs can reuse the space. VACUUM does not return space to the OS; that requires VACUUM FULL (which rewrites the table and takes an ACCESS EXCLUSIVE lock) or pg_repack (online table rewrite with minimal locking).

Warning

PostgreSQL's transaction ID space is 32 bits, which gives ~2 billion XIDs before wraparound. The autovacuum system runs a special "freeze" operation that rewrites old tuples with a sentinel FrozenTransactionId so they are visible to everyone forever. If autovacuum cannot keep up — because long-running transactions, lock contention, or sheer write volume block it — the cluster will shut down to prevent data corruption when it gets within 1 million XIDs of wraparound. Monitor pg_stat_database.datfrozenxid against autovacuum_freeze_max_age on every production cluster.

The Write-Ahead Log: Durability and the Replication Backbone

Before any 8 KB data page is flushed to its heap file, PostgreSQL writes a record describing the change to the Write-Ahead Log. The WAL is a sequential, append-only stream of 16 MB segment files in $PGDATA/pg_wal. On a crash, PostgreSQL replays WAL from the last checkpoint forward to bring the data files back to a consistent state. The same WAL stream feeds streaming replication, point-in-time recovery, and logical decoding.

text
The WAL Write Path on COMMIT

  ┌────────────┐  1. INSERT/UPDATE/DELETE issued
  │  backend   │
  │ (your tx)  │  2. Modify in-memory page in shared_buffers
  └─────┬──────┘     (page marked DIRTY)
        │
        │ 3. Generate WAL record describing the change
        ▼
  ┌──────────────────────────────────────────────────────┐
  │           wal_buffers (shared memory)                 │
  │  ┌──────┬──────┬──────┬──────┬──────┐                │
  │  │ rec1 │ rec2 │ rec3 │ rec4 │ rec5 │  in-memory     │
  │  └──────┴──────┴──────┴──────┴──────┘  ring buffer   │
  └────────────────────┬─────────────────────────────────┘
                       │ 4. COMMIT flushes wal_buffers
                       │    via pg_fsync() — backend BLOCKS
                       │    here until fsync returns
                       ▼
  ┌──────────────────────────────────────────────────────┐
  │            pg_wal/ (on disk, 16 MB segments)          │
  │   000000010000001A000000B3                            │
  │   000000010000001A000000B4  ← currently writing       │
  │   000000010000001A000000B5                            │
  └────────────┬──────────────────────────┬──────────────┘
               │                          │
               ▼ 5a. fed to streaming      ▼ 5b. fed to archiver
                    replicas via                (archive_command
                    walsender →                  copies to S3)
                    walreceiver

This is what makes PostgreSQL durable. The data page itself does not need to be on disk at commit time — only its WAL record does. The kernel can lazily flush the dirty data page at any later time (or the checkpointer flushes it on the next checkpoint). If the server crashes between COMMIT and the dirty page flush, recovery replays the WAL record on startup and reconstructs the change.

Three knobs control the durability/performance trade-off:

  • synchronous_commit — when on (default), COMMIT waits for the WAL flush. When off, COMMIT returns immediately and a background flush happens within wal_writer_delay ms. The trade-off is up to ~600 ms of committed work that disappears on crash.
  • fsync — turning this off lies to the application about durability but can double write throughput. Never off in production, ever.
  • wal_compression — compresses full-page writes (the first time a page is touched after a checkpoint, the entire 8 KB page is written to WAL to protect against torn writes). Saves disk and replication bandwidth on write-heavy workloads.

Query Execution: From SQL String to Result Set

When a backend receives a SQL statement, it runs through a five-stage pipeline before any rows come back. Understanding the stages explains why EXPLAIN (ANALYZE, BUFFERS) output looks the way it does and why "plan stability" matters in production.

text
Query Execution Pipeline

  ┌──────────────────────────────────────────────────────────────────────┐
  │  1. PARSER                                                            │
  │     SQL text → parse tree (raw AST, no semantic checks yet)           │
  │     Catches syntax errors. Output: RawStmt list.                     │
  └────────────────────────────┬─────────────────────────────────────────┘
                               │
                               ▼
  ┌──────────────────────────────────────────────────────────────────────┐
  │  2. ANALYZER                                                          │
  │     parse tree → Query tree (semantic AST)                            │
  │     Resolves table/column names against catalog, type-checks every   │
  │     expression, expands * into column lists, processes views.        │
  │     Catches "relation does not exist", "column ambiguous", type      │
  │     mismatches.                                                       │
  └────────────────────────────┬─────────────────────────────────────────┘
                               │
                               ▼
  ┌──────────────────────────────────────────────────────────────────────┐
  │  3. REWRITER                                                          │
  │     Query tree → rewritten Query tree                                 │
  │     Applies rules from pg_rewrite (view expansion, RULE-based logic).│
  │     A SELECT against a view becomes a SELECT against the underlying  │
  │     tables.                                                           │
  └────────────────────────────┬─────────────────────────────────────────┘
                               │
                               ▼
  ┌──────────────────────────────────────────────────────────────────────┐
  │  4. PLANNER / OPTIMIZER                                               │
  │     Query tree → PlanTree                                             │
  │     Generates candidate plans (join orders, index choices, scan      │
  │     methods). Uses pg_statistic + pg_class.reltuples to estimate     │
  │     row counts and cost. Picks the lowest-cost plan.                 │
  │     This is where seq_scan vs index_scan vs bitmap_scan is decided. │
  └────────────────────────────┬─────────────────────────────────────────┘
                               │
                               ▼
  ┌──────────────────────────────────────────────────────────────────────┐
  │  5. EXECUTOR                                                          │
  │     PlanTree → tuples streamed back to client                         │
  │     Implements the iterator model (volcano-style): each node pulls   │
  │     tuples from its children. Joins, sorts, hashes, aggregates run  │
  │     here. work_mem is consumed per memory-hungry node.               │
  └──────────────────────────────────────────────────────────────────────┘

The planner is where most query performance lives or dies. It uses statistics from pg_statistic (collected by ANALYZE) to estimate the row count at each step of the plan tree, multiplies by per-operation cost constants (seq_page_cost, random_page_cost, cpu_tuple_cost, etc.), and picks the plan with the lowest total cost. When statistics are stale or histograms are wrong, the planner picks bad plans — this is why auto_explain, pg_stat_statements, and regular ANALYZE on volatile tables matter.

Tip

The single most useful diagnostic command is EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) <query>. BUFFERS reveals whether the plan is hitting shared_buffers or going to disk (look for shared hit vs shared read), which often explains a 100× latency difference between two runs of the same query.

Replication: Physical (Streaming) and Logical Decoding

Replication in PostgreSQL is built directly on the WAL stream. There are two flavors that share the same plumbing:

Physical / Streaming Replication

The primary's WAL writer ships the raw byte stream of pg_wal to a standby's WAL receiver. The standby applies WAL records exactly the same way crash recovery does, producing a byte-for-byte identical copy of every data file. This is the fastest, most efficient form of replication — but the standby is a clone, not a transformer. You cannot replicate a subset of tables, you cannot apply DDL filtering, and you cannot upgrade major versions across replicas.

text
Physical Streaming Replication Flow

     PRIMARY                                       STANDBY
  ┌──────────────────┐                          ┌──────────────────┐
  │   client tx      │                          │   read query     │
  │   COMMIT         │                          │   (hot standby)  │
  └────────┬─────────┘                          └────────┬─────────┘
           │ WAL record                                  │
           ▼                                             ▼
  ┌──────────────────┐                          ┌──────────────────┐
  │   wal_buffers    │                          │   startup proc   │
  └────────┬─────────┘                          │   replays WAL    │
           │ flush                              └────────▲─────────┘
           ▼                                             │
  ┌──────────────────┐                          ┌──────────────────┐
  │   pg_wal/        │                          │   pg_wal/        │
  │  (current seg)   │                          │  (received seg)  │
  └────────┬─────────┘                          └────────▲─────────┘
           │                                             │
           ▼                                             │
  ┌──────────────────┐    TCP/9999 (WAL stream)  ┌──────┴───────────┐
  │   walsender      │ ────────────────────────► │   walreceiver    │
  │   (per replica)  │    or replication slot   │   (per primary)  │
  └──────────────────┘                          └──────────────────┘

  Replication slot on primary tracks restart_lsn so WAL is retained
  until standby has confirmed receipt — prevents WAL recycling.

Standbys can be set up in two modes:

  • Asynchronous (default) — primary commits, replica catches up when it can. Replica lag can be seconds or hours depending on load. If the primary dies before the replica catches up, those committed transactions are lost.
  • Synchronous — primary's COMMIT waits for at least one standby (or N of M, depending on synchronous_standby_names) to confirm receipt before returning to the client. Zero-data-loss failover at the cost of commit latency tied to network round-trip.

Logical Replication / Logical Decoding

Logical replication uses the same WAL stream but parses it through a logical decoding output plugin (the built-in pgoutput, or third-party plugins like wal2json for Debezium and decoderbufs for Kafka Connect). The output is row-level change events (INSERT, UPDATE, DELETE with before/after values), not raw 8 KB pages. The subscriber applies those events as normal SQL on its own copy of the schema.

This unlocks use cases physical replication cannot handle:

  • Replicating a subset of tables (publish only what you want).
  • Major version upgrades — replicate from PG 13 to PG 16, then cut over.
  • Cross-cluster replication (different instance types, different cloud providers).
  • Feeding CDC pipelines into Kafka, Kinesis, or data warehouses (Debezium, Maxwell-equivalent for PG).

The trade-off is throughput and operational complexity. Logical replication uses replication slots that retain WAL until the subscriber confirms receipt — an orphaned slot is the single most common cause of pg_wal filling up and taking an instance read-only.

Warning

Every logical replication slot on a production primary needs monitoring. If the subscriber dies or falls behind, the slot's restart_lsn stops advancing and WAL accumulates indefinitely. Set max_slot_wal_keep_size as a safety valve, and alert when pg_replication_slots.active = false with non-trivial pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn).

Putting It Together: The End-to-End Path of a Single UPDATE

To make the architecture concrete, here is what happens — in order — when a client issues a single UPDATE accounts SET balance = balance - 100 WHERE id = 42:

text
End-to-End UPDATE Lifecycle

  1. Client sends SQL over libpq protocol to postmaster:5432
  2. postmaster fork()s a backend (or reuses pooled connection)
  3. Backend parses → analyzes → rewrites → plans the statement
  4. Planner picks index scan on accounts_pkey
  5. Executor opens index, finds tuple at ctid (1024, 7)
  6. Acquires ROW EXCLUSIVE lock on table, row-level lock on tuple
  7. Reads page 1024 into shared_buffers (if not already cached)
  8. Generates new tuple version with updated balance
  9. Writes WAL record: heap_update at lsn 0/3A2B1C8
 10. Inserts new tuple into page 1024 (HOT update if fits, else new page)
 11. Marks old tuple's xmax = current_xid (logical delete)
 12. Page marked dirty in shared_buffers
 13. Client issues COMMIT
 14. Backend writes commit WAL record
 15. Backend issues fsync on WAL — BLOCKS HERE
 16. Backend returns OK to client (synchronous_commit=on)
 17. wal_writer eventually flushes any remaining wal_buffers
 18. checkpointer eventually flushes dirty page 1024 to base/16384/2837
 19. autovacuum eventually reclaims the dead old tuple version
 20. WAL segment containing the change is recycled / archived
 21. Replication slot on standby ships the WAL record
 22. Standby's walreceiver writes to its pg_wal/
 23. Standby's startup process replays heap_update, updating its copy

Every line in that flow is a tunable subsystem. The 100 ms p99 latency on UPDATE comes from step 15 (WAL fsync). The 4 GB nightly disk growth comes from step 11 (dead tuples) blocked by an idle-in-transaction session. The replication lag spike at 2 AM comes from step 8 generating a full-page write because step 18 just ran a checkpoint. Understanding which step is the bottleneck is what makes PostgreSQL tuning a craft rather than a guess.

What This Architecture Buys You

The same design choices that made POSTGRES95 the slowest database in benchmarks have, decades later, made PostgreSQL the most operationally predictable open-source database in production:

  • Multi-process isolation means a single bad backend cannot corrupt the whole instance. Crashes are isolated; OOM kills take down one query, not the cluster.
  • MVCC without locks for readers means analytics queries do not block transactional load. The trade-off is bloat that VACUUM has to chase, which is well-understood after twenty years of operational experience.
  • WAL-anchored durability means a clean snapshot of "everything committed up to this LSN" is a first-class concept. PITR, streaming replication, and logical CDC all fall out of this single design choice.
  • Pluggable storage and extension model means TimescaleDB, Citus, pgvector, PostGIS, and TIDB-style sharded clusters can extend PostgreSQL without forking it. The same architecture serves OLTP, time-series, vector search, and geospatial workloads.

None of this is magic. It is a coherent set of choices that work together: process per connection, shared memory for the buffer pool, MVCC tuples, append-only WAL, cost-based planning, slot-based replication. Once you can trace a write from libpq packet through wal_buffers to the standby's startup process, every monitoring metric and every tuning parameter starts to fit into a mental model that explains what the cluster is actually doing — not just what its symptoms look like.

Where to go next
  • Read PostgreSQL's own documentation on physical storage — chapter 65 of the official docs.
  • Run pg_stat_statements for a week on a production cluster and find your top 10 queries by total time; those are where your tuning leverage lives.
  • Set up a streaming replica with a replication slot, then deliberately stop the replica for an hour and watch pg_wal/ grow. This is the single best way to internalize what slots actually do.
  • Use the JusDB PostgreSQL Config Optimizer to generate a postgresql.conf tuned to your RAM and workload — every parameter is explained, so you can connect the values back to the subsystems above.

Share this article

JusDB Team

Official JusDB content team