ClickHouse's MergeTree engine is why ClickHouse queries 1 billion rows in 0.3 seconds. Understanding how MergeTree stores, merges, and queries data isn't optional if you're running ClickHouse in production — the wrong table design multiplies your storage costs and kills query performance simultaneously.
- MergeTree stores data in sorted parts that merge asynchronously — writes are always fast, queries use sparse indexes
- The PRIMARY KEY controls data sorting and pruning; ORDER BY and PRIMARY KEY are often the same but can differ
- ReplicatedMergeTree adds automatic replication without a separate replication layer
- Choose the right MergeTree variant: SummingMergeTree for pre-aggregation, ReplacingMergeTree for deduplication
How MergeTree Storage Works
Parts and Merges
Every INSERT to a MergeTree table creates an immutable data part directory. ClickHouse merges these parts in the background, sorting rows by the ORDER BY key. The sparse primary index marks granule boundaries (every 8192 rows by default), enabling range queries to skip irrelevant granules entirely.
-- Basic MergeTree table
CREATE TABLE events
(
event_date Date,
event_ts DateTime64(3),
user_id UInt64,
event_type LowCardinality(String),
revenue_usd Decimal(10, 2),
properties Map(String, String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id, event_ts)
SETTINGS index_granularity = 8192;Choosing the ORDER BY Key
The ORDER BY key is the most critical design decision in ClickHouse. Queries that filter on high-cardinality columns in the ORDER BY prefix can skip most of the data. Queries that filter on columns NOT in the ORDER BY require full scans or secondary indexes.
-- GOOD: queries filter on event_date and user_id (first two ORDER BY columns)
SELECT count(), sum(revenue_usd)
FROM events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
AND user_id = 12345;
-- ClickHouse skips all granules outside this date+user range
-- BAD: filtering on a column not in ORDER BY prefix
SELECT count() FROM events
WHERE event_type = 'purchase'; -- full scan required
-- Fix: add event_type to ORDER BY, or use a secondary indexSecondary Indexes (Skipping Indexes)
-- Add a bloom filter index for low-selectivity columns
ALTER TABLE events
ADD INDEX idx_event_type (event_type)
TYPE bloom_filter(0.01)
GRANULARITY 4;
-- minmax index for range queries on non-key columns
ALTER TABLE events
ADD INDEX idx_revenue (revenue_usd)
TYPE minmax
GRANULARITY 1;
-- Materialize the index on existing data
ALTER TABLE events MATERIALIZE INDEX idx_event_type;MergeTree Variants for Common Patterns
ReplicatedMergeTree
-- Replicated table (uses ZooKeeper/ClickHouse Keeper for coordination)
CREATE TABLE events ON CLUSTER my_cluster
(
event_date Date,
user_id UInt64,
event_type String,
revenue_usd Decimal(10,2)
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events', -- ZooKeeper path
'{replica}' -- Replica name (from macros)
)
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);SummingMergeTree for Pre-Aggregation
-- Pre-aggregate revenue by user and day
CREATE TABLE user_daily_revenue
(
event_date Date,
user_id UInt64,
revenue_usd Decimal(10,2), -- summed during merge
event_count UInt64 -- summed during merge
)
ENGINE = SummingMergeTree()
ORDER BY (event_date, user_id);
-- INSERT raw data; ClickHouse sums during background merge
INSERT INTO user_daily_revenue VALUES ('2025-01-15', 42, 19.99, 1);
INSERT INTO user_daily_revenue VALUES ('2025-01-15', 42, 49.99, 1);
-- After merge: single row with revenue=69.98, count=2
-- Query must use sum() to handle pre-merge state
SELECT user_id, sum(revenue_usd), sum(event_count)
FROM user_daily_revenue
WHERE event_date = '2025-01-15'
GROUP BY user_id;Merges in ClickHouse happen asynchronously. Until parts merge, SummingMergeTree can have duplicate rows. Always use sum() aggregation in queries against SummingMergeTree — never SELECT *.
Partition Management
-- Drop a partition (instant, no row-by-row delete)
ALTER TABLE events DROP PARTITION 202412;
-- Detach (move to separate storage, reattach later)
ALTER TABLE events DETACH PARTITION 202411;
-- Move partition to cheaper storage tier
ALTER TABLE events MOVE PARTITION 202409 TO VOLUME 'cold_storage';
-- Check partition sizes
SELECT
partition,
count() AS parts,
sum(rows) AS total_rows,
formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE table = 'events' AND active = 1
GROUP BY partition
ORDER BY partition DESC;- The ORDER BY key controls data sorting and query pruning — it is the single most important design decision for a ClickHouse table.
- Secondary (skipping) indexes with bloom filters accelerate queries on columns not in the ORDER BY prefix.
- SummingMergeTree and ReplacingMergeTree eliminate the need for external aggregation and deduplication pipelines.
- Partition management with DROP/DETACH PARTITION is instant — use daily or monthly partitions to enable cheap data retention policies.
Working with JusDB on ClickHouse
JusDB designs ClickHouse table schemas, tunes MergeTree configurations, and optimizes ORDER BY key selection for engineering teams building analytics pipelines. We ensure your ClickHouse deployment delivers sub-second query performance at TB scale without over-provisioning hardware.
Explore JusDB Analytics Services → | Talk to a DBA
Related reading: