Analytics & OLAP

ClickHouse MergeTree Engine: Storage Architecture and Query Optimization

ClickHouse MergeTree is why ClickHouse queries 1 billion rows in 0.3 seconds. Learn how parts, merges, and sparse indexes work, and how to choose the ORDER BY key, secondary indexes, and MergeTree variants.

JusDB Team
September 2, 2025
10 min read
183 views

ClickHouse's MergeTree engine is why ClickHouse queries 1 billion rows in 0.3 seconds. Understanding how MergeTree stores, merges, and queries data isn't optional if you're running ClickHouse in production — the wrong table design multiplies your storage costs and kills query performance simultaneously.

TL;DR
  • MergeTree stores data in sorted parts that merge asynchronously — writes are always fast, queries use sparse indexes
  • The PRIMARY KEY controls data sorting and pruning; ORDER BY and PRIMARY KEY are often the same but can differ
  • ReplicatedMergeTree adds automatic replication without a separate replication layer
  • Choose the right MergeTree variant: SummingMergeTree for pre-aggregation, ReplacingMergeTree for deduplication

How MergeTree Storage Works

Parts and Merges

Every INSERT to a MergeTree table creates an immutable data part directory. ClickHouse merges these parts in the background, sorting rows by the ORDER BY key. The sparse primary index marks granule boundaries (every 8192 rows by default), enabling range queries to skip irrelevant granules entirely.

sql
-- Basic MergeTree table
CREATE TABLE events
(
    event_date   Date,
    event_ts     DateTime64(3),
    user_id      UInt64,
    event_type   LowCardinality(String),
    revenue_usd  Decimal(10, 2),
    properties   Map(String, String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id, event_ts)
SETTINGS index_granularity = 8192;

Choosing the ORDER BY Key

The ORDER BY key is the most critical design decision in ClickHouse. Queries that filter on high-cardinality columns in the ORDER BY prefix can skip most of the data. Queries that filter on columns NOT in the ORDER BY require full scans or secondary indexes.

sql
-- GOOD: queries filter on event_date and user_id (first two ORDER BY columns)
SELECT count(), sum(revenue_usd)
FROM events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
  AND user_id = 12345;
-- ClickHouse skips all granules outside this date+user range

-- BAD: filtering on a column not in ORDER BY prefix
SELECT count() FROM events
WHERE event_type = 'purchase';  -- full scan required
-- Fix: add event_type to ORDER BY, or use a secondary index

Secondary Indexes (Skipping Indexes)

sql
-- Add a bloom filter index for low-selectivity columns
ALTER TABLE events
ADD INDEX idx_event_type (event_type)
TYPE bloom_filter(0.01)
GRANULARITY 4;

-- minmax index for range queries on non-key columns
ALTER TABLE events
ADD INDEX idx_revenue (revenue_usd)
TYPE minmax
GRANULARITY 1;

-- Materialize the index on existing data
ALTER TABLE events MATERIALIZE INDEX idx_event_type;

MergeTree Variants for Common Patterns

ReplicatedMergeTree

sql
-- Replicated table (uses ZooKeeper/ClickHouse Keeper for coordination)
CREATE TABLE events ON CLUSTER my_cluster
(
    event_date   Date,
    user_id      UInt64,
    event_type   String,
    revenue_usd  Decimal(10,2)
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events',  -- ZooKeeper path
    '{replica}'                            -- Replica name (from macros)
)
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);

SummingMergeTree for Pre-Aggregation

sql
-- Pre-aggregate revenue by user and day
CREATE TABLE user_daily_revenue
(
    event_date   Date,
    user_id      UInt64,
    revenue_usd  Decimal(10,2),  -- summed during merge
    event_count  UInt64          -- summed during merge
)
ENGINE = SummingMergeTree()
ORDER BY (event_date, user_id);

-- INSERT raw data; ClickHouse sums during background merge
INSERT INTO user_daily_revenue VALUES ('2025-01-15', 42, 19.99, 1);
INSERT INTO user_daily_revenue VALUES ('2025-01-15', 42, 49.99, 1);
-- After merge: single row with revenue=69.98, count=2

-- Query must use sum() to handle pre-merge state
SELECT user_id, sum(revenue_usd), sum(event_count)
FROM user_daily_revenue
WHERE event_date = '2025-01-15'
GROUP BY user_id;
Warning

Merges in ClickHouse happen asynchronously. Until parts merge, SummingMergeTree can have duplicate rows. Always use sum() aggregation in queries against SummingMergeTree — never SELECT *.

Partition Management

sql
-- Drop a partition (instant, no row-by-row delete)
ALTER TABLE events DROP PARTITION 202412;

-- Detach (move to separate storage, reattach later)
ALTER TABLE events DETACH PARTITION 202411;

-- Move partition to cheaper storage tier
ALTER TABLE events MOVE PARTITION 202409 TO VOLUME 'cold_storage';

-- Check partition sizes
SELECT
    partition,
    count() AS parts,
    sum(rows) AS total_rows,
    formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE table = 'events' AND active = 1
GROUP BY partition
ORDER BY partition DESC;
Key Takeaways
  • The ORDER BY key controls data sorting and query pruning — it is the single most important design decision for a ClickHouse table.
  • Secondary (skipping) indexes with bloom filters accelerate queries on columns not in the ORDER BY prefix.
  • SummingMergeTree and ReplacingMergeTree eliminate the need for external aggregation and deduplication pipelines.
  • Partition management with DROP/DETACH PARTITION is instant — use daily or monthly partitions to enable cheap data retention policies.

Working with JusDB on ClickHouse

JusDB designs ClickHouse table schemas, tunes MergeTree configurations, and optimizes ORDER BY key selection for engineering teams building analytics pipelines. We ensure your ClickHouse deployment delivers sub-second query performance at TB scale without over-provisioning hardware.

Explore JusDB Analytics Services →  |  Talk to a DBA

Related reading:

Share this article

JusDB Team

Official JusDB content team