ClickHouse Explained 2026: MergeTree, Distributed & Real-Time OLAP

Q: What is ClickHouse and what is it best for?

ClickHouse is an open-source columnar OLAP database built for high-throughput analytics over very large datasets. It excels at append-mostly workloads (logs, events, metrics, time-series) and sub-second aggregations over billions of rows. Best for real-time dashboards, log analytics, customer-facing analytics, and ad-hoc OLAP. Avoid for OLTP / transactional workloads with frequent updates.

Q: How does MergeTree work in ClickHouse?

MergeTree is the foundational table engine. Data is sorted by the ORDER BY key and written to small immutable parts on disk. A background process asynchronously merges these parts into larger ones. Queries skip entire data parts via primary-key index granules (default 8192 rows per granule), so even a 10TB table scans a tiny fraction. Variants: ReplicatedMergeTree (HA), ReplacingMergeTree (idempotent inserts), AggregatingMergeTree (pre-aggregated rollups), CollapsingMergeTree (logical deletes).

Q: ClickHouse vs PostgreSQL — when to use which?

ClickHouse wins on read-heavy analytics: 10-100× faster on aggregations, 5-10× better compression. PostgreSQL wins on transactional workloads, complex JOINs, frequent updates, and rich SQL features. Pattern: PostgreSQL for OLTP, ClickHouse for the analytics layer; replicate via Debezium or Materialize.

Q: How does ClickHouse handle updates and deletes?

ClickHouse is append-optimized; updates are expensive. Two options: (1) ALTER TABLE ... UPDATE/DELETE mutations — async, expensive, rewrite entire parts. (2) ReplacingMergeTree + FINAL — insert new versions, merge collapses duplicates by sort key. ClickHouse 23.3+ added lightweight deletes ( DELETE FROM ... WHERE ... ) which mark rows for deletion without rewriting parts.

Q: What's the difference between ClickHouse, StarRocks, and Druid?

ClickHouse : best single-table aggregations, weakest on JOINs. StarRocks : 5-10× better JOIN performance via cost-based optimizer + vectorized engine — better for star/snowflake schemas. Druid : real-time ingestion + time-series specialization; weaker general SQL. Use ClickHouse for flat-event analytics, StarRocks for JOIN-heavy lakehouse queries, Druid for streaming dashboards with sub-second SLAs.

Q: Does ClickHouse support vector search?

Yes, as of ClickHouse 23.5+ with the Annoy index and 24.x's USearch index . Vector columns are Array(Float32) with L2Distance / cosineDistance functions and ANN index types ( annoy , usearch ). Production scale: ~10M vectors with sub-100ms p99. For larger scale, dedicated vector DBs (Qdrant, Weaviate) still win, but for hybrid analytics-+-vector workloads ClickHouse is uniquely well-suited.

Q: How do I deploy ClickHouse in production?

Three paths: (1) ClickHouse Cloud — managed by ClickHouse Inc, autoscaling, $0.36/CPU-hr. (2) Self-host with ClickHouse Keeper (Raft-based, replaces ZooKeeper since 23.3) on Kubernetes via the Altinity Operator or clickhouse-operator . (3) Bare-metal with manual ReplicatedMergeTree setup. Production rule: 3-node minimum for HA, 16+ CPU cores per node, 10G NVMe storage, separate hot/cold storage tiers via storage_policy .

Q: What are common ClickHouse pitfalls in production?

(1) Wrong sort key — picks affect compression + query speed by 10×. Choose by query predicate frequency. (2) Too many small inserts — batch to ≥10k rows or use async_insert. (3) Cardinality explosions in dictionaries / Nullable columns — eats RAM. (4) JOIN-heavy queries — ClickHouse's hash JOIN is single-threaded by default; use SET join_algorithm = 'partial_merge' or migrate JOIN-heavy workloads to StarRocks. (5) Schema migrations on multi-TB tables — mutations are expensive; design schema once, evolve via views.

TL;DR — ClickHouse in 60 seconds: ClickHouse is a columnar OLAP database designed for sub-second analytics on billions of rows. Its core is the MergeTree family of table engines — sort keys, partitioning, and async merges in the background give it 10-100× the throughput of Postgres or MySQL for aggregations. The default ReplicatedMergeTree handles HA via ZooKeeper / ClickHouse Keeper; Distributed engine shards across nodes; ReplacingMergeTree handles updates; AggregatingMergeTree pre-computes rollups. ClickHouse 24.x adds vector search (cosine, L2), parallel replicas, and lakehouse access (Iceberg, Delta). Use it for: real-time dashboards, log analytics, customer-facing analytics, ad-hoc OLAP. Skip it for: OLTP, frequent row updates, JOIN-heavy star schemas (use StarRocks).

ClickHouse MergeTree storage architecture diagram showing columnar storage and merge process

A fintech client came to us last year with a straightforward problem: their PostgreSQL-based reporting pipeline was taking 45 minutes to run a daily aggregation across 800 million transaction rows. They'd already thrown hardware at it — r6g.8xlarge, 256GB RAM, optimized indexes. Still 45 minutes.

We moved the analytics workload to ClickHouse. The same query: 4.2 seconds.

That's not a cherry-picked edge case. That's what columnar storage + vectorized execution does when matched to the right workload. This guide explains when ClickHouse is the right call, how to architect it correctly, and — equally important — when not to use it.

TL;DR

ClickHouse is a columnar OLAP database — not a replacement for PostgreSQL or MySQL
Best for: analytics on billions of rows, log aggregation, dashboards, ad-tech, IoT telemetry
Wrong choice for: transactional workloads, frequent single-row updates, complex JOINs across many tables
MergeTree engine selection is the most important architectural decision — get it wrong and you'll fight performance forever
Materialized Views are the secret weapon — most teams underuse them

What Makes ClickHouse Different

Traditional databases like PostgreSQL and MySQL store data row by row. When you run SELECT sum(revenue) FROM orders WHERE date > '2024-01-01', the database reads every column of every matching row, even if you only need two columns. For OLAP queries on wide tables, this is enormously wasteful.

ClickHouse stores data column by column. That same query reads only the revenue and date columns — skipping everything else. On a 200-column table, that's the difference between scanning 200× the data you need versus exactly the data you need.

Combine that with:

Vectorized execution: Processes 1,024 values at a time using SIMD CPU instructions — dramatically higher throughput per clock cycle than row-at-a-time processing
LZ4/ZSTD compression: Columnar data compresses 5–10× better than row data, meaning less I/O on every query
Data skipping indexes: Each part stores min/max values per block, allowing the query engine to skip entire blocks of data without reading them
Distributed execution: Queries parallelize across all shards and replicas automatically

The result is a database that can aggregate billions of rows in seconds on commodity hardware.

Architecture Deep Dive

The MergeTree Engine Family

MergeTree is the core storage engine. Everything else is built on it. Choosing the wrong variant is the single most common mistake we see in ClickHouse deployments:

Engine	Use Case	Key Behavior
MergeTree	Immutable event data (logs, clickstream)	Baseline. Keeps all rows, sorts by ORDER BY key
ReplacingMergeTree	Latest-state records (user profiles, inventory)	Deduplicates on primary key during background merge
SummingMergeTree	Pre-aggregated counters	Merges rows by summing numeric columns
AggregatingMergeTree	Materialized View backing	Stores partial aggregation states, merges incrementally
CollapsingMergeTree	Mutable rows with explicit delete markers	Sign column (+1/-1) controls row visibility
ReplicatedMergeTree	Any of the above with HA	Adds ZooKeeper/ClickHouse Keeper replication

Our recommendation for most teams: Start with ReplicatedMergeTree for all production tables. The replication overhead is minimal and you avoid a painful migration later if you need HA.

How Inserts and Merges Work

This is the part that catches teams off-guard: ClickHouse doesn't write data directly to its final location. Inserts create small parts on disk. Background merge threads continuously consolidate these parts into larger, sorted, compressed parts.

text

-- Check current part count and size
SELECT
    table,
    count() AS parts,
    formatReadableSize(sum(bytes_on_disk)) AS disk_size,
    sum(rows) AS total_rows
FROM system.parts
WHERE active AND database = 'analytics'
GROUP BY table
ORDER BY sum(bytes_on_disk) DESC;

If you insert in tiny batches (one row at a time), you'll create thousands of parts and hit "Too many parts" errors. Minimum batch size for production: 10,000 rows per INSERT. Ideal: 100,000–1,000,000 rows per batch.

text

-- WRONG: one row at a time (destroys performance)
INSERT INTO events VALUES (now(), 1001, '/home', 30);

-- RIGHT: batch inserts via HTTP interface
echo "2025-01-15 10:00:00,1001,/home,30
2025-01-15 10:00:01,1002,/pricing,15" | \
  clickhouse-client --query="INSERT INTO events FORMAT CSV"

-- Or batch from application
client.insert('events', rows_list, column_names=['event_time','user_id','url','duration'])

Schema Design: Getting the ORDER BY Right

The ORDER BY (primary key) in ClickHouse is not just for uniqueness — it determines the physical sort order of data on disk. Queries that filter on the ORDER BY columns can use the sparse primary index to skip entire granules (8,192 rows by default). Queries that filter on other columns do full scans.

text

-- Log analytics table — good ORDER BY
CREATE TABLE access_logs (
    event_date  Date,
    event_time  DateTime,
    user_id     UInt64,
    url         String,
    status_code UInt16,
    duration_ms UInt32,
    bytes_sent  UInt64
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/access_logs', '{replica}')
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id, event_time)  -- most common filter columns first
SETTINGS index_granularity = 8192;

Rules for ORDER BY selection:

Put the highest-cardinality filter column last, lowest-cardinality first (e.g., date, tenant_id, user_id)
Time is almost always the first or second column for time-series data
Once set, ORDER BY cannot be changed without rewriting the table — get it right upfront

Partition Design

text

-- Monthly partitions for log data (good default)
PARTITION BY toYYYYMM(event_date)

-- Daily partitions for high-volume data (> 1TB/day)
PARTITION BY toDate(event_time)

-- Drop old partitions instantly (no DELETE overhead)
ALTER TABLE access_logs DROP PARTITION '202401';

Partitions are the unit of data management in ClickHouse. Dropping a partition is instantaneous — it just removes a directory. This makes TTL management trivial for time-series data.

Materialized Views: The Secret Weapon

Most teams use ClickHouse for raw event storage and run aggregation queries at query time. Teams that really squeeze performance use Materialized Views to pre-aggregate data at insert time.

text

-- Raw events table
CREATE TABLE events (
    event_time DateTime,
    user_id    UInt64,
    url        String,
    duration   UInt32
) ENGINE = ReplicatedMergeTree(...)
PARTITION BY toDate(event_time)
ORDER BY (toDate(event_time), url);

-- Pre-aggregated view: hourly URL stats
CREATE MATERIALIZED VIEW url_stats_hourly
ENGINE = AggregatingMergeTree()
PARTITION BY toDate(hour)
ORDER BY (hour, url)
AS SELECT
    toStartOfHour(event_time) AS hour,
    url,
    countState() AS visit_count,
    avgState(duration) AS avg_duration
FROM events
GROUP BY hour, url;

-- Query the materialized view — sub-millisecond for billions of source rows
SELECT
    hour,
    url,
    countMerge(visit_count) AS visits,
    round(avgMerge(avg_duration), 0) AS avg_ms
FROM url_stats_hourly
WHERE hour >= toStartOfDay(today())
GROUP BY hour, url
ORDER BY visits DESC
LIMIT 20;

With this pattern, your dashboard query hits pre-aggregated data that's updated incrementally as events arrive. We've used this to power dashboards showing real-time metrics over 50 billion events with sub-100ms query times.

Query Optimization

Use EXPLAIN to understand query plans

text

EXPLAIN indexes = 1
SELECT url, count() AS views
FROM access_logs
WHERE event_date >= '2025-01-01'
  AND user_id = 1001
GROUP BY url
ORDER BY views DESC;

-- Look for:
-- ReadFromMergeTree (parts read vs skipped)
-- Keys condition: shows which index conditions apply
-- Initial parts: X, Selected parts: Y  (Y << X = good)

Data Skipping Indexes for non-ORDER BY columns

text

-- Add a bloom filter index for string equality lookups
ALTER TABLE access_logs ADD INDEX url_bloom_idx url
TYPE bloom_filter(0.01) GRANULARITY 1;

-- Minmax index for range queries on numeric columns
ALTER TABLE access_logs ADD INDEX duration_minmax duration
TYPE minmax GRANULARITY 4;

-- Materialize the index on existing data
ALTER TABLE access_logs MATERIALIZE INDEX url_bloom_idx;

Avoid these patterns

text

-- BAD: wildcard at start of LIKE (full scan, no index)
SELECT * FROM access_logs WHERE url LIKE '%checkout%';

-- BETTER: use startsWith or position()
SELECT * FROM access_logs WHERE startsWith(url, '/checkout');

-- BAD: high-cardinality GROUP BY without ORDER BY alignment
SELECT user_id, count() FROM events GROUP BY user_id;  -- scans everything

-- BETTER: filter by date first to limit scan range
SELECT user_id, count() FROM events
WHERE event_date = today()
GROUP BY user_id;

ClickHouse vs PostgreSQL vs StarRocks

Dimension	ClickHouse	PostgreSQL	StarRocks
Storage model	Columnar	Row (+ columnar via extensions)	Columnar
OLAP query speed	Excellent	Moderate	Excellent
OLTP (transactions)	Not suitable	Excellent	Limited
Concurrency (many users)	Moderate	Good	Excellent (CN nodes)
Lakehouse (Iceberg/Parquet)	Via table engines	Via foreign data wrappers	Native
UPDATE/DELETE	Heavy (use CollapsingMergeTree)	Native MVCC	Primary key updates supported
Best for	Log analytics, observability, ad-tech	Transactional apps, mixed workloads	Real-time dashboards, API serving

Many of our clients run ClickHouse alongside PostgreSQL — Postgres handles writes and OLTP, ClickHouse handles analytics and reporting. The two complement each other well, especially with a CDC pipeline (via Flink CDC) streaming changes from Postgres into ClickHouse.

Deployment Patterns

Single Node (Development / Small Scale)

text

# Docker quickstart
docker run -d \
  --name clickhouse-server \
  -p 8123:8123 \
  -p 9000:9000 \
  -v $(pwd)/data:/var/lib/clickhouse \
  clickhouse/clickhouse-server:latest

# Connect
clickhouse-client --host localhost --port 9000

Replicated Cluster (Production)

text

-- config.xml: define cluster topology
<remote_servers>
  <analytics_cluster>
    <shard>
      <replica>
        <host>ch-node-01</host>
        <port>9000</port>
      </replica>
      <replica>
        <host>ch-node-02</host>
        <port>9000</port>
      </replica>
    </shard>
    <shard>
      <replica>
        <host>ch-node-03</host>
        <port>9000</port>
      </replica>
      <replica>
        <host>ch-node-04</host>
        <port>9000</port>
      </replica>
    </shard>
  </analytics_cluster>
</remote_servers>

-- Create distributed table on top of local tables
CREATE TABLE events_dist AS events
ENGINE = Distributed('analytics_cluster', 'default', 'events', rand());

ClickHouse Cloud

ClickHouse Cloud handles sharding, replication, and upgrades automatically. For teams that don't want to operate the cluster themselves, it's the fastest path to production. Pricing is consumption-based — you pay for queries run, not idle capacity.

Monitoring ClickHouse

text

-- Active queries (who's running what)
SELECT query_id, user, elapsed, read_rows, memory_usage, query
FROM system.processes
ORDER BY elapsed DESC;

-- Part health (watch for too many parts)
SELECT table, count() AS parts, sum(rows) AS rows
FROM system.parts
WHERE active AND database = currentDatabase()
GROUP BY table
ORDER BY parts DESC;

-- Replication lag
SELECT table, is_leader, absolute_delay
FROM system.replicas
WHERE absolute_delay > 0;

-- Merge queue depth
SELECT table, count() AS pending_merges
FROM system.merges
GROUP BY table;

A healthy ClickHouse cluster has:

Parts per table: < 300 (alert above 1,000)
Replication lag: < 10 seconds
Merge queue: actively draining, not growing indefinitely
Memory per query: set max_memory_usage to avoid OOM on large queries

When NOT to Use ClickHouse

This is the part most guides skip. ClickHouse is the wrong tool when:

You need row-level updates frequently: Every UPDATE in ClickHouse rewrites entire parts. If you're updating individual records, use PostgreSQL.
You have complex multi-table JOINs: ClickHouse prefers wide, denormalized tables. Joining 10 normalized tables is painful compared to a star schema or pre-joined flat table.
Your query concurrency is high (100s of concurrent users): ClickHouse excels at a few heavy queries, not thousands of light ones. For high-concurrency serving, StarRocks or a materialized PostgreSQL cache may serve better.
Your insert volume is low (< 1M events/day): The operational overhead of ClickHouse is only worth it at scale. For smaller datasets, TimescaleDB on Postgres handles analytics workloads with less complexity.

Working with JusDB on ClickHouse

We've architected and operated ClickHouse clusters ranging from single-node setups handling 10M events/day to distributed clusters ingesting 500M events/day for ad-tech and observability platforms. Common engagements include:

Schema and MergeTree engine review: We see ORDER BY and engine selection mistakes in almost every self-built cluster we audit. A 2-hour review often unlocks 5–10× query speedups.
Materialized View design: Designing the right pre-aggregation strategy for your dashboard queries.
Migration from PostgreSQL or Cassandra: Moving analytics workloads to ClickHouse with zero data loss.
Production cluster operations: Monitoring, alerting, merge health, and incident response.

If your ClickHouse cluster is slow, parts are accumulating, or you're evaluating ClickHouse for a new analytics platform, talk to us. We're happy to do a free 30-minute architecture review.

Frequently Asked Questions

What is ClickHouse and what is it best for?

ClickHouse is an open-source columnar OLAP database built for high-throughput analytics over very large datasets. It excels at append-mostly workloads (logs, events, metrics, time-series) and sub-second aggregations over billions of rows. Best for real-time dashboards, log analytics, customer-facing analytics, and ad-hoc OLAP. Avoid for OLTP / transactional workloads with frequent updates.

How does MergeTree work in ClickHouse?

MergeTree is the foundational table engine. Data is sorted by the ORDER BY key and written to small immutable parts on disk. A background process asynchronously merges these parts into larger ones. Queries skip entire data parts via primary-key index granules (default 8192 rows per granule), so even a 10TB table scans a tiny fraction. Variants: ReplicatedMergeTree (HA), ReplacingMergeTree (idempotent inserts), AggregatingMergeTree (pre-aggregated rollups), CollapsingMergeTree (logical deletes).

ClickHouse vs PostgreSQL — when to use which?

ClickHouse wins on read-heavy analytics: 10-100× faster on aggregations, 5-10× better compression. PostgreSQL wins on transactional workloads, complex JOINs, frequent updates, and rich SQL features. Pattern: PostgreSQL for OLTP, ClickHouse for the analytics layer; replicate via Debezium or Materialize.

How does ClickHouse handle updates and deletes?

ClickHouse is append-optimized; updates are expensive. Two options: (1) ALTER TABLE ... UPDATE/DELETE mutations — async, expensive, rewrite entire parts. (2) ReplacingMergeTree + FINAL — insert new versions, merge collapses duplicates by sort key. ClickHouse 23.3+ added lightweight deletes (DELETE FROM ... WHERE ...) which mark rows for deletion without rewriting parts.

What's the difference between ClickHouse, StarRocks, and Druid?

ClickHouse: best single-table aggregations, weakest on JOINs. StarRocks: 5-10× better JOIN performance via cost-based optimizer + vectorized engine — better for star/snowflake schemas. Druid: real-time ingestion + time-series specialization; weaker general SQL. Use ClickHouse for flat-event analytics, StarRocks for JOIN-heavy lakehouse queries, Druid for streaming dashboards with sub-second SLAs.

Does ClickHouse support vector search?

Yes, as of ClickHouse 23.5+ with the Annoy index and 24.x's USearch index. Vector columns are Array(Float32) with L2Distance / cosineDistance functions and ANN index types (annoy, usearch). Production scale: ~10M vectors with sub-100ms p99. For larger scale, dedicated vector DBs (Qdrant, Weaviate) still win, but for hybrid analytics-+-vector workloads ClickHouse is uniquely well-suited.

How do I deploy ClickHouse in production?

Three paths: (1) ClickHouse Cloud — managed by ClickHouse Inc, autoscaling, $0.36/CPU-hr. (2) Self-host with ClickHouse Keeper (Raft-based, replaces ZooKeeper since 23.3) on Kubernetes via the Altinity Operator or clickhouse-operator. (3) Bare-metal with manual ReplicatedMergeTree setup. Production rule: 3-node minimum for HA, 16+ CPU cores per node, 10G NVMe storage, separate hot/cold storage tiers via storage_policy.

What are common ClickHouse pitfalls in production?

(1) Wrong sort key — picks affect compression + query speed by 10×. Choose by query predicate frequency. (2) Too many small inserts — batch to ≥10k rows or use async_insert. (3) Cardinality explosions in dictionaries / Nullable columns — eats RAM. (4) JOIN-heavy queries — ClickHouse's hash JOIN is single-threaded by default; use SET join_algorithm = 'partial_merge' or migrate JOIN-heavy workloads to StarRocks. (5) Schema migrations on multi-TB tables — mutations are expensive; design schema once, evolve via views.

ClickHouse Explained (2026): MergeTree, Distributed Engine & Real-Time OLAP

What Makes ClickHouse Different

Architecture Deep Dive

The MergeTree Engine Family

How Inserts and Merges Work

Schema Design: Getting the ORDER BY Right

Partition Design

Materialized Views: The Secret Weapon

Query Optimization

Use EXPLAIN to understand query plans

Data Skipping Indexes for non-ORDER BY columns

Avoid these patterns

ClickHouse vs PostgreSQL vs StarRocks

Deployment Patterns

Single Node (Development / Small Scale)

Replicated Cluster (Production)

ClickHouse Cloud

Monitoring ClickHouse

When NOT to Use ClickHouse

Working with JusDB on ClickHouse

Frequently Asked Questions

Share this article

JusDB Team

Keep reading

TimescaleDB Hypertables, Continuous Aggregates & Compression (2026 Production Guide)

StarRocks vs ClickHouse: Architecture, Table Models & When to Choose Each (2026)

StarRocks Monitoring & Alerting: The Complete Production Guide

Need Expert Help?

Database Performance Optimization

Apache Pinot

ClickHouse Services