Analytics & OLAP

Real-Time Analytics Pipeline: PostgreSQL to ClickHouse via CDC

Build a real-time analytics pipeline from PostgreSQL to ClickHouse using CDC — architecture, latency, and schema mapping

JusDB Team
February 2, 2026
12 min read
196 views

Running analytical queries directly against your PostgreSQL production database is a fast path to degraded performance, frustrated users, and an on-call page at 2 AM. The standard solution — periodically dumping data into a data warehouse — introduces latency that makes real-time dashboards impossible. Change Data Capture (CDC) solves both problems by streaming every insert, update, and delete from PostgreSQL to ClickHouse in near real-time, giving you analytical power without touching production. This post walks through building that pipeline end-to-end, covering every component from logical replication slots to ClickHouse materialized views.

TL;DR
  • Enable PostgreSQL logical replication and create a replication slot for Debezium
  • Deploy Debezium PostgreSQL connector to capture row-level changes as structured events
  • Route events through Kafka topics for buffering, replay, and fan-out
  • Use ClickHouse Kafka engine tables + materialized views to consume and land data
  • Use ReplacingMergeTree with a version column to handle upserts from CDC
  • End-to-end latency is typically under 5 seconds under normal load
  • Monitor consumer group lag in Kafka and use system.replication_queue in ClickHouse

Architecture Overview

The pipeline follows a five-stage linear flow:

  1. PostgreSQL — source of truth, with logical replication enabled
  2. Debezium — reads the WAL (Write-Ahead Log) and converts changes to structured JSON or Avro events
  3. Apache Kafka — durable, ordered message queue; decouples producer throughput from consumer processing speed
  4. ClickHouse Kafka engine — built-in connector that reads directly from Kafka topics
  5. ClickHouse MergeTree tables — the analytical landing zone, queryable by your BI tools

Each stage is independently scalable. Kafka absorbs bursts from PostgreSQL write spikes without overwhelming ClickHouse. ClickHouse Kafka engine tables can fan out to multiple materialized views for different analytical use cases. Debezium handles schema evolution and reconnection logic so you do not have to.

The managed alternative to self-hosting this stack is ClickHousePipe (covered at the end), which wraps most of this complexity into a single configuration step.


PostgreSQL Setup: Logical Replication

Debezium reads PostgreSQL's Write-Ahead Log using the logical replication protocol. You need to enable this at the server level and create a publication that scopes which tables are replicated.

In postgresql.conf, set:

ini
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4

Restart PostgreSQL after changing wal_level. Then create a dedicated replication user and a publication:

sql
-- Create replication user
CREATE USER debezium_user REPLICATION LOGIN PASSWORD 'strong_password_here';

-- Grant SELECT on the tables you want to replicate
GRANT SELECT ON TABLE public.orders, public.order_items, public.customers TO debezium_user;

-- Create a publication scoped to specific tables
CREATE PUBLICATION debezium_pub FOR TABLE public.orders, public.order_items, public.customers;

-- Verify
SELECT * FROM pg_publication;
Warning

Do not use FOR ALL TABLES in production unless you intend to replicate every table, including internal ones. Scope your publication explicitly to avoid replicating system tables or tables with sensitive data you do not want in your analytics cluster.

Debezium will create its own replication slot automatically on first connect. You can also pre-create it:

sql
-- Pre-create the replication slot (Debezium will use this name from config)
SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

-- Check existing slots and their retained WAL
SELECT slot_name, plugin, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;
Warning

Replication slots prevent PostgreSQL from discarding WAL segments until the slot consumer has confirmed them. If Debezium goes offline for an extended period, retained WAL will grow unbounded and can fill your disk. Set max_slot_wal_keep_size (PostgreSQL 13+) to impose a hard limit, accepting that the slot may be invalidated if that limit is hit.


Debezium Connector Configuration

Deploy Debezium as a Kafka Connect connector. The connector configuration is submitted as a JSON payload to the Kafka Connect REST API:

json
{
  "name": "postgres-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",

    "database.hostname": "postgres.internal",
    "database.port": "5432",
    "database.user": "debezium_user",
    "database.password": "strong_password_here",
    "database.dbname": "production",
    "database.server.name": "prod_pg",

    "plugin.name": "pgoutput",
    "publication.name": "debezium_pub",
    "slot.name": "debezium_slot",

    "table.include.list": "public.orders,public.order_items,public.customers",

    "topic.prefix": "cdc",
    "topic.creation.enable": "true",
    "topic.creation.default.replication.factor": "3",
    "topic.creation.default.partitions": "6",

    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "true",
    "value.converter.schemas.enable": "true",

    "heartbeat.interval.ms": "10000",
    "snapshot.mode": "initial"
  }
}

Submit it with:

bash
curl -X POST http://kafka-connect:8083/connectors \
  -H "Content-Type: application/json" \
  -d @connector-config.json

With topic.prefix = cdc and a table named public.orders, Debezium publishes to the Kafka topic cdc.public.orders. Each message payload includes the full before and after row state, the operation type (c for create, u for update, d for delete), and a source timestamp.

Tip

Set snapshot.mode to initial on first deploy to backfill existing rows, then change to never if you need to re-deploy the connector without re-snapshotting. For very large tables, consider initial_only followed by switching to never after the snapshot completes.


Kafka in the Middle

Kafka's role here is durable buffering. Key configuration decisions:

  • Partitions: Use 6+ partitions per topic to allow ClickHouse to parallelize consumption across multiple threads.
  • Retention: Set topic retention to at least 7 days (retention.ms = 604800000). This gives you a replay window if ClickHouse ingestion fails or needs to be rebuilt.
  • Replication factor: Use 3 for production. A factor of 1 means a single broker failure causes data loss.
  • Compression: Enable lz4 or snappy on the topic. CDC events are highly compressible JSON and compression significantly reduces broker storage and network I/O.
bash
# Update retention on an existing topic
kafka-configs.sh --bootstrap-server kafka:9092 \
  --entity-type topics \
  --entity-name cdc.public.orders \
  --alter \
  --add-config retention.ms=604800000,compression.type=lz4

ClickHouse Kafka Engine Table

ClickHouse has a built-in Kafka table engine that acts as a consumer. You create a "virtual" Kafka table that reads from the topic, then pipe it into a real MergeTree table via a materialized view.

First, create the Kafka engine table:

sql
CREATE TABLE kafka_orders_raw
(
    before       Nullable(String),
    after        String,
    op           String,   -- 'c', 'u', 'd', 'r' (read/snapshot)
    ts_ms        Int64,
    source_table String
)
ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'kafka-1:9092,kafka-2:9092,kafka-3:9092',
    kafka_topic_list = 'cdc.public.orders',
    kafka_group_name = 'clickhouse_orders_consumer',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 3,
    kafka_skip_broken_messages = 10;
Tip

Set kafka_num_consumers to match the number of Kafka topic partitions you want ClickHouse to consume in parallel. Each consumer thread handles one or more partitions. Going above the partition count wastes resources without improving throughput.

Next, create the destination table using ReplacingMergeTree:

sql
CREATE TABLE orders
(
    order_id         UInt64,
    customer_id      UInt64,
    status           LowCardinality(String),
    total_amount     Decimal(18, 4),
    created_at       DateTime64(3, 'UTC'),
    updated_at       DateTime64(3, 'UTC'),
    is_deleted       UInt8 DEFAULT 0,
    _cdc_version     Int64   -- sourced from Debezium ts_ms
)
ENGINE = ReplacingMergeTree(_cdc_version)
PARTITION BY toYYYYMM(created_at)
ORDER BY (customer_id, order_id);

Finally, create the materialized view that bridges the two:

sql
CREATE MATERIALIZED VIEW mv_orders_from_kafka TO orders AS
SELECT
    JSONExtractUInt(after, 'order_id')                                 AS order_id,
    JSONExtractUInt(after, 'customer_id')                              AS customer_id,
    JSONExtractString(after, 'status')                                 AS status,
    JSONExtractFloat(after, 'total_amount')                            AS total_amount,
    fromUnixTimestamp64Milli(JSONExtractInt64(after, 'created_at'))    AS created_at,
    fromUnixTimestamp64Milli(JSONExtractInt64(after, 'updated_at'))    AS updated_at,
    if(op = 'd', 1, 0)                                                 AS is_deleted,
    ts_ms                                                              AS _cdc_version
FROM kafka_orders_raw
WHERE op IN ('c', 'u', 'r', 'd');
Warning

ReplacingMergeTree deduplicates asynchronously during merges, not immediately on insert. Queries against recently inserted data may return duplicate rows until ClickHouse completes a merge. Use FINAL in queries (SELECT ... FROM orders FINAL WHERE ...) to force deduplication at query time, accepting the performance trade-off, or schedule periodic OPTIMIZE TABLE orders FINAL runs during low-traffic windows.


PostgreSQL to ClickHouse Type Mapping

Debezium serializes PostgreSQL types into JSON or Avro. When consuming JSON in ClickHouse, you parse these with JSONExtract* functions. The type mapping to be aware of:

PostgreSQL Type Debezium JSON Representation ClickHouse Type
integer, bigint JSON number Int32, Int64
numeric, decimal JSON string (to preserve precision) Decimal(P, S)
timestamp without time zone Microseconds since epoch (Int64) DateTime64(6, 'UTC')
timestamp with time zone Microseconds since epoch (Int64) DateTime64(6, 'UTC')
date Days since epoch (Int32) Date
boolean JSON boolean UInt8
uuid JSON string UUID or String
jsonb, json JSON string (escaped) String (parse with JSONExtract*)
text[], int[] JSON array string Array(String), Array(Int64)
enum JSON string LowCardinality(String)
Warning

PostgreSQL numeric/decimal values are serialized as strings by Debezium to avoid floating-point precision loss. In your ClickHouse materialized view, use toDecimal128(JSONExtractString(after, 'price'), 4) rather than JSONExtractFloat, which will silently lose precision on financial data.


Monitoring the Pipeline

A CDC pipeline has three distinct lag metrics you need to track. Missing any one of them means you can have an outage without knowing it.

1. Kafka consumer group lag (Debezium falling behind PostgreSQL WAL):

bash
kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --describe \
  --group clickhouse_orders_consumer

Alert when lag exceeds your SLA threshold — typically a few thousand messages for low-latency pipelines.

2. Replication slot retained WAL (PostgreSQL side):

sql
SELECT
    slot_name,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS unflushed_wal
FROM pg_replication_slots
WHERE slot_name = 'debezium_slot';

Alert when retained WAL exceeds 5 GB or when active = false.

3. ClickHouse Kafka engine consumption errors:

sql
-- Check for parse errors in Kafka engine consumption
SELECT *
FROM system.kafka_consumers
WHERE database = 'analytics'
  AND table = 'kafka_orders_raw';

-- Check replication queue for any stuck parts
SELECT *
FROM system.replication_queue
WHERE table = 'orders'
ORDER BY create_time DESC
LIMIT 20;

End-to-end latency in this architecture is typically under 5 seconds under normal PostgreSQL write load. The dominant contributors are Debezium's polling interval (configurable, default 500ms), Kafka producer batching (tunable), and ClickHouse's Kafka engine poll interval (kafka_poll_timeout_ms, default 500ms). All three can be tuned lower at the cost of increased overhead.


ClickHousePipe: The Managed Alternative

Self-hosting Debezium, Kafka Connect, and ClickHouse together requires operational expertise across three different systems. ClickHousePipe is ClickHouse Cloud's managed CDC service that handles the connector infrastructure for you. You configure the source PostgreSQL connection, the target ClickHouse table, and the service manages replication slots, Kafka internals, schema mapping, and failover.

The trade-off is less flexibility — you have fewer knobs to tune schema mapping edge cases or custom transformations — but the operational overhead is dramatically lower for teams that do not want to run Kafka infrastructure. For greenfield projects or teams with limited DevOps capacity, ClickHousePipe is worth evaluating before committing to the self-hosted stack.

Tip

Even if you use ClickHousePipe, understanding the underlying architecture described in this post is valuable. When you hit schema evolution issues, latency spikes, or data gaps, the debugging mental model is identical — you are still reasoning about WAL positions, consumer lag, and MergeTree merge behavior.


Key Takeaways

Key Takeaways
  • PostgreSQL logical replication with pgoutput plugin is the standard, stable foundation — enable it at the server level and scope your publication to specific tables.
  • Debezium handles WAL reading complexity, reconnection, and schema evolution; configure it with heartbeats and cap retained WAL with max_slot_wal_keep_size.
  • Kafka's role is decoupling and durability — set retention to at least 7 days so you have a replay window for ClickHouse failures.
  • ClickHouse Kafka engine + materialized views is a zero-dependency way to consume topics; no separate connector process needed on the ClickHouse side.
  • ReplacingMergeTree with a CDC timestamp as the version column handles upserts; use FINAL at query time or schedule OPTIMIZE for deduplication.
  • Map PostgreSQL numeric/decimal via string extraction, not float, to preserve financial precision.
  • Monitor three things: Kafka consumer lag, PostgreSQL replication slot retained WAL size, and ClickHouse system.kafka_consumers error counts.
  • End-to-end latency under 5 seconds is achievable without aggressive tuning; sub-second latency is possible with reduced poll intervals at higher infrastructure cost.
  • ClickHousePipe is a managed alternative that eliminates Kafka and Debezium operations for teams that prefer less infrastructure ownership.

Need Help Designing Your Analytics Pipeline?

Building a CDC pipeline involves decisions that compound over time: partition count, schema evolution strategy, delete handling, and backfill procedures all interact in ways that are hard to predict upfront. At JusDB, we help data engineering teams design and implement PostgreSQL-to-ClickHouse pipelines that handle production edge cases — schema changes, high-volume delete workloads, multi-region replication, and BI tool integration.

Whether you are evaluating the self-hosted Debezium/Kafka stack or a managed service like ClickHousePipe, we can help you make the right architectural trade-offs for your scale and team. Get in touch to discuss your use case, or explore our database consulting services to see how we work with engineering teams on analytics infrastructure.

Share this article